Use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in LangChain and other LLM applications.
In my recentย introduction to LangChain, I touched briefly on LangSmith. Here, weโll take a closer look at the platform, which works in tandem with LangChain and can also be used with other LLM frameworks.
My quick take on LangSmith is that you can use it to trace and evaluate LLM applications and intelligent agents and move them from prototype to production. Hereโs what the LangSmith documentation says about it:
LangSmithย is a platform for building production-grade LLM applications. It lets you debug, test, evaluate, and monitor chains and intelligent agents built onย any LLM frameworkย and seamlessly integrates withย LangChain, the go-to open-source framework for building with LLMs.
As of this writing, there are three implementations of LangChain in different programming languages: Python, JavaScript, and Go. Weโll use the Python implementation for our examples.
LangSmith with LangChain
So, basics. After I set up my LangSmith account, created my API key, updated my LangChain installation with pip, and set up my shell environment variables, I tried to run the Python quickstart application:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
llm.predict("Hello, world!")
Before we discuss the results, take a look at the LangSmith hub:
IDG
Figure 1. The LangSmith hub acts as a repository for prompts, models, use cases, and other LLM artifacts.
Moving on to the next tab, here is the trace list from the default project:
IDG
Figure 2. The Python project trace list shows logs of my six attempts to run the quickstart. The first five were unsuccessful: the Python output indicated a timeout from OpenAI.
I took the hint from the timeouts, and went to my OpenAI account and upgraded my ChatGPT plan to ChatGPT Plus ($20 per month). That gave me access to GPT-4 and the ChatGPT plugins, but my program still didnโt run. I left it turned on: I suspect Iโll need the additional capabilities.
Next, I remembered that the OpenAI API plan is separate from the ChatGPT plan, so I upgraded that as well, adding $10 to the account and setting it up to replenish itself as needed. Now the Python program ran to completion, and I was able to see the successful results in LangSmith:
IDG
Figure 3. A successful prediction run, finally. Note the Playground button at the top right of the screen.
Looking at the metadata tab for this run told me that it ran the โHello, World!โ prompt against the gpt-3.5-turbo model at a sampling temperature of 0.7. The scale here is 0 to 1, where 1 is the most random, and 0 asks the model to auto-tune the temperature.
IDG
Figure 4. The metadata for a successful prediction. In addition to the YAML block at the bottom, thereโs a JSON block with the same information.
Overview of LangSmith
LangSmith logs all calls to LLMs, chains, agents, tools, and retrievers in a LangChain or other LLM program. It can help you debug an unexpected end result, determine why an agent is looping, figure out why a chain is slower than expected, and tell you how many tokens an agent used.
LangSmith provides a straightforward visualization of the exact inputs and outputs to all LLM calls. You might think that the input side would be simple, but youโd be wrong: In addition to the input variables (prompt), an LLM call uses a template and often auxiliary functions; for example, retrieval of information from the web, uploaded files, and system prompts that set the context for the LLM.
In general, you should keep LangSmith turned on for all work with LangChainโyou only have to look at the logs when they matter. One of the useful things you can try, if an input prompt doesnโt give you the results you need, is to take the prompt to the Playground, which is shown in Figure 5 below. Use the button at the top right of the LangSmith trace page (shown in Figure 4) to navigate to the Playground.
IDG
Figure 5. The LangSmith Playground allows you to interactively edit your input, change model and temperature, adjust other parameters, add function calls, add stop sequences, and add human, AI, system, function, and chat messages. This is a time saver compared to editing all of this in a Python program.
Donโt forget to add your API keys to the website using the Secrets & API Keys button. Note that playground runs are stored in a separate LangSmith project.
LangSmith LLMChain example
In my introduction to LangChain, I gave the example of an LLMChain that combines a ChatOpenAI call with a simple comma-separated list parser. Looking at the LangSmith log for this Python code helps us understand whatโs happening in the program.
The parser is a subclass of the BaseOutputParser class. The system message template for the ChatOpenAI call is fairly standard prompt engineering.
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
from langchain.chains import LLMChain
from langchain.schema import BaseOutputParser
class CommaSeparatedListOutputParser(BaseOutputParser):
"""Parse the output of an LLM call to a comma-separated list."""
def parse(self, text: str):
"""Parse the output of an LLM call."""
return text.strip().split(", ")
template = """You are a helpful assistant who generates comma separated lists.
A user will pass in a category, and you should generate 5 objects in that category in a comma separated list.
ONLY return a comma separated list, and nothing more."""
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
chain = LLMChain(
llm=ChatOpenAI(),
prompt=chat_prompt,
output_parser=CommaSeparatedListOutputParser()
)
chain.run("colors")
IDG
Figure 6. The Run tab for the top-level chain shows the human input, the parsed output, the latency (under a second), and the tokens used, as well as the clock time and call status.
Diving down to the ChatOpenAI LLM call provides additional information, shown in Figure 7.
IDG
Figure 7. At the LLM level, we see the system input and the output produced by the LLM before parsing.
We can glean even more information from the metadata, shown in Figure 8.
IDG
Figure 8. The metadata for the ChatOpenAI call tells us the model used (gpt-3.5-turbo), the sampling temperature (0.7), and the runtime version numbers.
LangSmith evaluation quickstart
This walkthrough evaluates a chain using a dataset of examples. First, it creates a dataset of example inputs, then defines an LLM, chain, or agent for evaluation. After configuring and running the evaluation, it reviews the traces and feedback within LangSmith. Letโs start with the code. Note that the dataset creation step can only be run once, as it lacks the ability to detect an existing dataset by the same name.
from langsmith import Client
example_inputs = [
"a rap battle between Atticus Finch and Cicero",
"a rap battle between Barbie and Oppenheimer",
"a Pythonic rap battle between two swallows: one European and one African",
"a rap battle between Aubrey Plaza and Stephen Colbert",
]
client = Client()
dataset_name = "Rap Battle Dataset"
# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
dataset_name=dataset_name, description="Rap battle prompts.",
)
for input_prompt in example_inputs:
# Each example must be unique and have inputs defined.
# Outputs are optional
client.create_example(
inputs={"question": input_prompt},
outputs=None,
dataset_id=dataset.id,
)
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
# Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.
def create_chain():
llm = ChatOpenAI(temperature=0)
return LLMChain.from_string(llm, "Spit some bars about {input}.")
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria(
{"cliche": "Are the lyrics cliche?"
" Respond Y if they are, N if they're entirely unique."}
)
]
)
run_on_dataset(
client=client,
dataset_name=dataset_name,
llm_or_chain_factory=create_chain,
evaluation=eval_config,
verbose=True,
project_name="llmchain-test-1",
)
We have a lot more to look at for this example than the last one. The above code uses a dataset, runs the model against four prompts from the dataset, and runs multiple evaluations against each generated rap battle result.
Here are the evaluation statistics, which were printed in the terminal during the run:
Eval quantiles:
0.25 0.5 0.75 mean mode
harmfulness 0.00 0.0 0.0 0.00 0.0
helpfulness 0.75 1.0 1.0 0.75 1.0
cliche 1.00 1.0 1.0 1.00 1.0
Somebody had fun creating the rap battle prompts, as shown in the dataset below:
IDG
Figure 9. The key-value dataset created by the client.create_dataset()call.
As an aside, I had to look up Aubrey Plaza, who played the deadpan comic character April Ludgateย inย Parks and Recreation.
This code used its own project name, llmchain-test-1, so thatโs where we look for results:
IDG
Figure 10. The first line in each pair is the LLM chain result, and the second is the LLM result.
Here is the Barbie vs. Oppenheimer rap battle, as generated by gpt-3.5-turbo.
IDG
Figure 11. This is the end of the Barbie/Oppenheimer rap battle text generated by the LLM chain. It wonโt win any prizes.
The LangSmith Cookbook
While theย standard LangSmith documentationย covers the basics, the LangSmith Cookbook repository delves into common patterns and real-world use-cases. You should clone or fork the repo to run the code. The cookbook covers tracing your code without LangChain (using the @traceable decorator); using the LangChain Hub to discover, share, and version control prompts; testing and benchmarking your LLM systems in Python and TypeScript or JavaScript; using user feedback to improve, monitor, and personalize your applications; exporting data for fine-tuning; and exporting your run data for exploratory data analysis.
Conclusion
LangSmith is a platform that works in tandem with LangChain or by itself. In this article, youโve seen how to use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in a production-grade LLM application.


