Advanced

LangSmith

LangSmith is LangChain's observability and evaluation platform. It provides tracing for debugging, datasets for testing, evaluation for quality measurement, and annotation queues for human review. This domain accounts for ~20% of the certification. Includes practice questions.

Tracing

Tracing captures every step of your LLM application's execution, including prompts, model responses, tool calls, latency, and token usage. It is essential for debugging and monitoring.

# LangSmith Tracing Setup
import os

# Enable tracing with environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__your_api_key"
os.environ["LANGCHAIN_PROJECT"] = "my-project"  # Optional: organize by project

# That's it! All LangChain calls are now traced automatically.
# Every chain.invoke(), model.invoke(), retriever.invoke() is logged.

# View traces at: https://smith.langchain.com

# What tracing captures:
tracing_captures = {
    "Input/Output": "Full prompt and response for every LLM call",
    "Latency": "Time taken for each step and total chain execution",
    "Token usage": "Input tokens, output tokens, and cost",
    "Tool calls": "Which tools were called, with what arguments, and results",
    "Errors": "Full stack traces for any failures",
    "Nested runs": "Parent-child relationships between chain steps",
    "Metadata": "Custom tags, session IDs, user IDs"
}

# Add custom metadata to traces
from langchain_core.tracers.context import tracing_v2_enabled

with tracing_v2_enabled(project_name="my-project", tags=["production"]):
    result = chain.invoke({"question": "What is LangSmith?"})

# Or add metadata per-call
result = chain.invoke(
    {"question": "What is LangSmith?"},
    config={"metadata": {"user_id": "user-123", "session_id": "sess-456"}}
)

Datasets & Evaluation

# LangSmith Datasets and Evaluation
from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="qa-test-set",
    description="QA evaluation dataset for RAG pipeline"
)

# Add examples to the dataset
examples = [
    {
        "inputs": {"question": "What is LangChain?"},
        "outputs": {"answer": "LangChain is a framework for building LLM applications."}
    },
    {
        "inputs": {"question": "What is RAG?"},
        "outputs": {"answer": "RAG is retrieval-augmented generation."}
    }
]

for example in examples:
    client.create_example(
        inputs=example["inputs"],
        outputs=example["outputs"],
        dataset_id=dataset.id
    )

# Run evaluation
from langsmith.evaluation import evaluate

def predict(inputs: dict) -> dict:
    """Run your chain on the input."""
    result = rag_chain.invoke(inputs["question"])
    return {"answer": result}

# Custom evaluator
def correctness(run, example) -> dict:
    """Check if the prediction matches the reference."""
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1 if reference.lower() in prediction.lower() else 0
    return {"key": "correctness", "score": score}

# Run evaluation
results = evaluate(
    predict,
    data="qa-test-set",
    evaluators=[correctness],
    experiment_prefix="rag-v1"
)

# Built-in evaluators you should know:
from langsmith.evaluation import LangChainStringEvaluator

evaluators = {
    "criteria": "LLM-based evaluation against criteria (correctness, helpfulness)",
    "embedding_distance": "Cosine distance between prediction and reference embeddings",
    "string_distance": "Edit distance between prediction and reference strings",
    "exact_match": "Exact string match between prediction and reference"
}

Annotation Queues

# Annotation Queues - Human review of LLM outputs
from langsmith import Client

client = Client()

# Create an annotation queue
queue = client.create_annotation_queue(
    name="review-queue",
    description="Queue for reviewing RAG responses"
)

# Add runs to the queue for review
# (This is typically done through the LangSmith UI or programmatically)

# Annotation queue use cases:
annotation_uses = {
    "Quality review": "Human reviewers score LLM outputs for quality",
    "Dataset curation": "Approved examples are added to evaluation datasets",
    "Feedback loop": "Use human feedback to improve prompts and chains",
    "Compliance": "Review outputs for safety, accuracy, and policy compliance",
    "Edge case collection": "Identify and catalog failure cases"
}

# LangSmith features overview:
langsmith_features = {
    "Tracing": "Automatic logging of all LLM calls, chains, and tools",
    "Datasets": "Create test sets for systematic evaluation",
    "Evaluation": "Run automated evaluations with custom metrics",
    "Annotation Queues": "Human review workflows for quality assurance",
    "Monitoring": "Track latency, cost, error rates in production",
    "Prompt Hub": "Version and manage prompt templates",
    "Playground": "Test prompts interactively in the browser"
}

Prompt Management

# LangSmith Prompt Hub - Version and manage prompts
from langchain import hub

# Pull a prompt from the Hub
prompt = hub.pull("rlm/rag-prompt")

# Push a prompt to the Hub
from langchain_core.prompts import ChatPromptTemplate

my_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer based on context:\n{context}"),
    ("human", "{question}")
])

hub.push("my-org/rag-prompt-v2", my_prompt)

# Prompt versioning benefits:
prompt_versioning = {
    "Version control": "Track changes to prompts over time",
    "A/B testing": "Compare prompt versions with evaluation",
    "Rollback": "Revert to a previous prompt version if quality drops",
    "Collaboration": "Team members can share and review prompts",
    "Audit trail": "See who changed what and when"
}

Practice Questions

💡
Test your knowledge of LangSmith:
Q1: What environment variables enable LangSmith tracing?

Answer: Set LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY=ls__your_key. Optionally set LANGCHAIN_PROJECT to organize traces by project. Once these are set, all LangChain operations are automatically traced without any code changes. Traces are viewable at smith.langchain.com.

Q2: How do you create a custom evaluator in LangSmith?

Answer: Write a function that takes run (containing the prediction) and example (containing the reference) as arguments and returns a dictionary with "key" (metric name) and "score" (numeric value). Pass this function in the evaluators list when calling evaluate(). The evaluator can use simple string matching, LLM-based grading, or any custom logic.

Q3: What is the purpose of annotation queues?

Answer: Annotation queues enable human reviewers to systematically evaluate LLM outputs. Runs are added to the queue, and reviewers score or label them. Use cases include: quality assurance, dataset curation (approved outputs become evaluation examples), feedback loops for improving prompts, compliance checking, and collecting edge cases for robustness testing.

Q4: What information does a LangSmith trace capture?

Answer: A trace captures: (1) full input/output for every LLM call, (2) latency for each step, (3) token usage and estimated cost, (4) tool calls with arguments and results, (5) error stack traces, (6) nested parent-child relationships between chain steps, and (7) custom metadata (tags, user IDs, session IDs). This provides complete observability into chain execution.

Q5: How does the LangSmith Prompt Hub help with prompt management?

Answer: The Prompt Hub provides version control for prompts. You can push prompts with hub.push() and pull them with hub.pull(). Benefits include: tracking changes over time, A/B testing different versions with evaluation, rolling back if quality drops, team collaboration and review, and maintaining an audit trail. It decouples prompts from application code.

Key Takeaways

💡
  • Enable tracing with two environment variables — LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY
  • Create datasets with examples (inputs + expected outputs) for systematic evaluation
  • Write custom evaluators that return {"key": "name", "score": value}
  • Annotation queues enable human review workflows for quality assurance
  • The Prompt Hub provides version control and collaboration for prompt templates