Goal Success Rate Evaluator
Overview
Section titled “Overview”The GoalSuccessRateEvaluator evaluates whether all user goals were successfully achieved in a conversation. It provides a holistic assessment of whether the agent accomplished what the user set out to do, considering the entire conversation session. A complete example can be found here.
Key Features
Section titled “Key Features”- Session-Level Evaluation: Evaluates the entire conversation session
- Goal-Oriented Assessment: Focuses on whether user objectives were met
- Binary Scoring: Simple Yes/No evaluation for clear success/failure determination
- Structured Reasoning: Provides step-by-step reasoning for the evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
- Holistic View: Considers all interactions in the session
When to Use
Section titled “When to Use”Use the GoalSuccessRateEvaluator when you need to:
- Measure overall task completion success
- Evaluate if user objectives were fully achieved
- Assess end-to-end conversation effectiveness
- Track success rates across different scenarios
- Identify patterns in successful vs. unsuccessful interactions
- Optimize agents for goal achievement
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the SESSION_LEVEL, meaning it evaluates the entire conversation session as a whole, not individual turns or tool calls.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt to guide the judge model’s behavior.
Scoring System
Section titled “Scoring System”The evaluator uses a binary scoring system:
- Yes (1.0): All user goals were successfully achieved
- No (0.0): User goals were not fully achieved
A session passes the evaluation only if the score is 1.0 (all goals achieved).
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import GoalSuccessRateEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
# Setup telemetrytelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()memory_exporter = telemetry.in_memory_exporter
# Define task functiondef user_task_function(case: Case) -> dict: memory_exporter.clear()
agent = Agent( trace_attributes={ "gen_ai.conversation.id": case.session_id, "session.id": case.session_id }, callback_handler=None ) agent_response = agent(case.input)
# Map spans to session finished_spans = memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Create test casestest_cases = [ Case[str, str]( name="math-1", input="What is 25 * 4?", metadata={"category": "math", "goal": "calculate_result"} ), Case[str, str]( name="math-2", input="Calculate the square root of 144", metadata={"category": "math", "goal": "calculate_result"} ),]
# Create evaluatorevaluator = GoalSuccessRateEvaluator()
# Run evaluationexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(user_task_function)reports[0].run_display()Evaluation Output
Section titled “Evaluation Output”The GoalSuccessRateEvaluator returns EvaluationOutput objects with:
- score:
1.0(Yes) or0.0(No) - test_pass:
Trueif score >= 1.0,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: “Yes” or “No”
What Gets Evaluated
Section titled “What Gets Evaluated”The evaluator examines:
- Available Tools: Tools that were available to the agent
- Conversation Record: Complete history of all messages and tool executions
- User Goals: Implicit or explicit goals from the user’s queries
- Final Outcome: Whether the conversation achieved the user’s objectives
The judge determines if the agent successfully helped the user accomplish their goals by the end of the session.
Best Practices
Section titled “Best Practices”- Use with Proper Telemetry Setup: The evaluator requires trajectory information captured via OpenTelemetry
- Define Clear Goals: Ensure test cases have clear, measurable objectives
- Capture Complete Sessions: Include all conversation turns in the trajectory
- Test Various Complexity Levels: Include simple and complex goal scenarios
- Combine with Other Evaluators: Use alongside helpfulness and trajectory evaluators
Common Patterns
Section titled “Common Patterns”Pattern 1: Task Completion
Section titled “Pattern 1: Task Completion”Evaluate if specific tasks were completed successfully.
Pattern 2: Multi-Step Goals
Section titled “Pattern 2: Multi-Step Goals”Assess achievement of goals requiring multiple steps.
Pattern 3: Information Retrieval
Section titled “Pattern 3: Information Retrieval”Determine if users obtained the information they needed.
Example Scenarios
Section titled “Example Scenarios”Scenario 1: Successful Goal Achievement
Section titled “Scenario 1: Successful Goal Achievement”User: "I need to book a flight from NYC to LA for next Monday"Agent: [Searches flights, shows options, books selected flight]Final: "Your flight is booked! Confirmation number: ABC123"Evaluation: Yes (1.0) - Goal fully achievedScenario 2: Partial Achievement
Section titled “Scenario 2: Partial Achievement”User: "I need to book a flight from NYC to LA for next Monday"Agent: [Searches flights, shows options]Final: "Here are available flights. Would you like me to book one?"Evaluation: No (0.0) - Goal not completed (booking not finalized)Scenario 3: Failed Goal
Section titled “Scenario 3: Failed Goal”User: "I need to book a flight from NYC to LA for next Monday"Agent: "I can help with general travel information."Evaluation: No (0.0) - Goal not achievedScenario 4: Complex Multi-Goal Success
Section titled “Scenario 4: Complex Multi-Goal Success”User: "Find the cheapest flight to Paris, book it, and send confirmation to my email"Agent: [Searches flights, compares prices, books cheapest option, sends email]Final: "Booked the €450 flight and sent confirmation to your email"Evaluation: Yes (1.0) - All goals achievedCommon Issues and Solutions
Section titled “Common Issues and Solutions”Issue 1: No Evaluation Returned
Section titled “Issue 1: No Evaluation Returned”Problem: Evaluator returns empty results. Solution: Ensure trajectory contains a complete session with at least one agent invocation span.
Issue 2: Ambiguous Goals
Section titled “Issue 2: Ambiguous Goals”Problem: Unclear what constitutes “success” for a given query. Solution: Provide clearer test case descriptions or expected outcomes in metadata.
Issue 3: Partial Success Scoring
Section titled “Issue 3: Partial Success Scoring”Problem: Agent partially achieves goals but evaluator marks as failure. Solution: This is by design - the evaluator requires full goal achievement. Consider using HelpfulnessEvaluator for partial success assessment.
Differences from Other Evaluators
Section titled “Differences from Other Evaluators”- vs. HelpfulnessEvaluator: Goal success is binary (achieved/not achieved), helpfulness is graduated
- vs. OutputEvaluator: Goal success evaluates overall achievement, output evaluates response quality
- vs. TrajectoryEvaluator: Goal success evaluates outcome, trajectory evaluates the path taken
Use Cases
Section titled “Use Cases”Use Case 1: Customer Service
Section titled “Use Case 1: Customer Service”Evaluate if customer issues were fully resolved.
Use Case 2: Task Automation
Section titled “Use Case 2: Task Automation”Measure success rate of automated task completion.
Use Case 3: Information Retrieval
Section titled “Use Case 3: Information Retrieval”Assess if users obtained all needed information.
Use Case 4: Multi-Step Workflows
Section titled “Use Case 4: Multi-Step Workflows”Evaluate completion of complex, multi-step processes.
Related Evaluators
Section titled “Related Evaluators”- HelpfulnessEvaluator: Evaluates helpfulness of individual responses
- TrajectoryEvaluator: Evaluates the sequence of actions taken
- OutputEvaluator: Evaluates overall output quality with custom criteria
- FaithfulnessEvaluator: Evaluates if responses are grounded in context