Output Evaluator
Overview
Section titled “Overview”The OutputEvaluator is an LLM-based evaluator that assesses the quality of agent outputs against custom criteria. It uses a judge LLM to evaluate responses based on a user-defined rubric, making it ideal for evaluating subjective qualities like safety, relevance, accuracy, and completeness. A complete example can be found here.
Key Features
Section titled “Key Features”- Flexible Rubric System: Define custom evaluation criteria tailored to your use case
- LLM-as-a-Judge: Leverages a language model to perform nuanced evaluations
- Structured Output: Returns standardized evaluation results with scores and reasoning
- Async Support: Supports both synchronous and asynchronous evaluation
- Input Context: Optionally includes input prompts in the evaluation for context-aware scoring
When to Use
Section titled “When to Use”Use the OutputEvaluator when you need to:
- Evaluate subjective qualities of agent responses (e.g., helpfulness, safety, tone)
- Assess whether outputs meet specific business requirements
- Check for policy compliance or content guidelines
- Compare different agent configurations or prompts
- Evaluate responses where ground truth is not available or difficult to define
Parameters
Section titled “Parameters”rubric (required)
Section titled “rubric (required)”- Type:
str - Description: The evaluation criteria that defines what constitutes a good response. Should include scoring guidelines (e.g., “Score 1 if…, 0.5 if…, 0 if…”).
model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str - Default: Built-in template
- Description: Custom system prompt to guide the judge model’s behavior. If not provided, uses a default template optimized for evaluation.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True - Description: Whether to include the input prompt in the evaluation context. Set to
Falseif you only want to evaluate the output in isolation.
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import OutputEvaluator
# Define your task functiondef get_response(case: Case) -> str: agent = Agent( system_prompt="You are a helpful assistant.", callback_handler=None ) response = agent(case.input) return str(response)
# Create test casestest_cases = [ Case[str, str]( name="greeting", input="Hello, how are you?", expected_output="A friendly greeting response", metadata={"category": "conversation"} ),]
# Create evaluator with custom rubricevaluator = OutputEvaluator( rubric=""" Evaluate the response based on: 1. Accuracy - Is the information correct? 2. Completeness - Does it fully answer the question? 3. Clarity - Is it easy to understand?
Score 1.0 if all criteria are met excellently. Score 0.5 if some criteria are partially met. Score 0.0 if the response is inadequate. """, include_inputs=True)
# Create and run experimentexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(get_response)reports[0].run_display()Evaluation Output
Section titled “Evaluation Output”The OutputEvaluator returns EvaluationOutput objects with:
- score: Float between 0.0 and 1.0 representing the evaluation score
- test_pass: Boolean indicating if the test passed (based on score threshold)
- reason: String containing the judge’s reasoning for the score
- label: Optional label categorizing the result
Best Practices
Section titled “Best Practices”- Write Clear, Specific Rubrics: Include explicit scoring criteria and examples
- Use Appropriate Judge Models: Consider using stronger models for complex evaluations
- Include Input Context When Relevant: Set
include_inputs=Truefor context-dependent evaluation - Validate Your Rubric: Test with known good and bad examples to ensure expected scores
- Combine with Other Evaluators: Use alongside trajectory and tool evaluators for comprehensive assessment
Related Evaluators
Section titled “Related Evaluators”- TrajectoryEvaluator: Evaluates the sequence of actions/tools used
- FaithfulnessEvaluator: Checks if responses are grounded in conversation history
- HelpfulnessEvaluator: Specifically evaluates helpfulness from user perspective
- GoalSuccessRateEvaluator: Evaluates if user goals were achieved