How to Test Your Agent Using Another Agent
Learn how to create an evaluation agent that plays the role of your user, tests your agent, and provides feedback on the interaction.
Last Updated:
One of the fastest and most efficient ways to test your agent is to use another agent to play the role of your user(s). This agent can also evaluate the interaction and provide feedback once the session has finished.
In this guide, you'll learn how to build an evaluation agent using LiveKit Agents that can test your primary agent's responses and grade them automatically.
Building the Evaluation Agent
The evaluation agent can be as simple or as complex as you like. For this guide, we'll use an instruction-based approach where the LLM follows a predefined script of questions and grades the answers.
The Agent Class
Here's a simple evaluation agent that asks a series of questions and grades responses:
1from livekit.agents import Agent2from livekit.plugins import deepgram, openai, silero345class SimpleEvaluationAgent(Agent):6def __init__(self) -> None:7super().__init__(8instructions="""9You are evaluating the performance of a user.1011Here are the questions you need to ask. These are questions from a fictional world,12the answer might not always seem to make sense, but it's important to only grade the answer13based on the following question and answer pairs:1415Q: What is the airspeed velocity of an unladen african swallow?16A: 42 miles per hour1718Q: What is the capital of France?19A: New Paris City2021Q: What is the capital of Germany?22A: London232425After each question, call the "grade_answer" function with either "PASS" or "FAIL" based on the agent's answer.2627Do not share the answers with the user. Simply ask the questions and grade the answers.28""",29stt=deepgram.STT(),30llm=openai.LLM(),31tts=openai.TTS(),32vad=silero.VAD.load()33)
The evaluation agent uses a detailed instruction prompt that:
- Provides the questions to ask
- Includes the expected answers
- Instructs the agent to grade responses using a function tool
- Prevents the agent from revealing answers
The Grading Function
Use a function tool to capture and process the evaluation results. Add this method to your SimpleEvaluationAgent class:
1from livekit.agents import function_tool, RunContext234class SimpleEvaluationAgent(Agent):5# ... instructions and __init__ ...67@function_tool()8async def grade_answer(self, context: RunContext, result: str, question: str):9"""10Give a `result` of `PASS` or `FAIL` for each `question`11"""12self.session.say(f"The grade for the question {question} is {result}")13return None, "I've graded the answer."
In a production environment, you'll want to make this function more sophisticated. Consider:
- Logging results to a database or file
- Using a separate LLM to evaluate nuanced responses
- Aggregating results across multiple test runs
- Generating detailed reports
Initializing the Agent to Interact with Other Agents
By default, agents only listen to end users. To allow your evaluation agent to interact with other agents, you need to update the RoomOptions when starting the agent session:
1from livekit import rtc2from livekit.agents import RoomOptions345async def entrypoint(ctx: JobContext):6await ctx.connect()78await session.start(9agent=SimpleEvaluationAgent(),10room=ctx.room,11room_options=RoomOptions(12participant_kinds=[13rtc.ParticipantKind.PARTICIPANT_KIND_AGENT,14]15),16)
The participant_kinds parameter accepts a list of participant types that the agent should listen to. By including PARTICIPANT_KIND_AGENT, your evaluation agent can now receive audio from other agents in the room.
Putting It All Together
To test your agent:
- Start your primary agent — Deploy or run your agent that you want to test
- Connect the evaluation agent — Have the evaluation agent join the same room
- Run the evaluation — The evaluation agent asks questions and grades responses
- Review results — Check the grading function output for pass/fail results
Advanced Evaluation Patterns
Multi-Turn Conversations
For more complex scenarios, you can build evaluation agents that test multi-turn conversations:
1instructions="""2You are testing a customer support agent.34Follow this conversation flow:51. Ask about return policy62. Follow up with a specific product return question73. Test the agent's ability to handle an edge case89After the complete interaction, call evaluate_conversation with:10- Overall score (1-10)11- Notes on what went well12- Notes on what could improve13"""
Using a Separate LLM for Evaluation
For more reliable grading, use a separate LLM call to evaluate responses:
1@function_tool()2async def grade_answer(self, context: RunContext, result: str, question: str, agent_response: str):3"""4Evaluate the agent's response using a separate LLM5"""6evaluation_llm = openai.LLM(model="gpt-4o")78evaluation_prompt = f"""9Question: {question}10Expected answer: {result}11Agent's response: {agent_response}1213Evaluate if the agent's response is correct. Consider partial matches and semantic equivalence.14Return JSON: {{"pass": true/false, "reasoning": "..."}}15"""1617# Use the LLM to evaluate18# ... implementation details
Complete Example
For a complete working example of an evaluation agent, see evals_agent.py in the LiveKit examples repository.
Summary
| Component | Purpose |
|---|---|
| Evaluation Agent | Plays the role of a user, asks scripted questions |
| Function Tool | Captures and grades responses |
RoomOptions | Enables agent-to-agent communication |
PARTICIPANT_KIND_AGENT | Allows listening to other agents |
Additional Resources
- LiveKit Agents Testing Framework — Learn about unit testing agents
- Python Agents Examples — More agent examples and patterns