Helpfulness Evaluator
Overview
Section titled “Overview”The HelpfulnessEvaluator evaluates the helpfulness of agent responses from the user’s perspective. It assesses whether responses effectively address user needs, provide useful information, and contribute positively to achieving the user’s goals. A complete example can be found here.
Key Features
Section titled “Key Features”- Trace-Level Evaluation: Evaluates the most recent turn in the conversation
- User-Centric Assessment: Focuses on helpfulness from the user’s point of view
- Seven-Level Scoring: Detailed scale from “Not helpful at all” to “Above and beyond”
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
- Context-Aware: Considers conversation history when evaluating helpfulness
When to Use
Section titled “When to Use”Use the HelpfulnessEvaluator when you need to:
- Assess user satisfaction with agent responses
- Evaluate if responses effectively address user queries
- Measure the practical value of agent outputs
- Compare helpfulness across different agent configurations
- Identify areas where agents could be more helpful
- Optimize agent behavior for user experience
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, meaning it evaluates the most recent turn in the conversation (the last agent response and its context).
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt to guide the judge model’s behavior.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True - Description: Whether to include the input prompt in the evaluation context.
Scoring System
Section titled “Scoring System”The evaluator uses a seven-level categorical scoring system:
- Not helpful at all (0.0): Response is completely unhelpful or counterproductive
- Very unhelpful (0.167): Response provides minimal or misleading value
- Somewhat unhelpful (0.333): Response has some issues that limit helpfulness
- Neutral/Mixed (0.5): Response is adequate but not particularly helpful
- Somewhat helpful (0.667): Response is useful and addresses the query
- Very helpful (0.833): Response is highly useful and well-crafted
- Above and beyond (1.0): Response exceeds expectations with exceptional value
A response passes the evaluation if the score is >= 0.5.
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import HelpfulnessEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
# Setup telemetrytelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()memory_exporter = telemetry.in_memory_exporter
# Define task functiondef user_task_function(case: Case) -> dict: memory_exporter.clear()
agent = Agent( trace_attributes={ "gen_ai.conversation.id": case.session_id, "session.id": case.session_id }, callback_handler=None ) agent_response = agent(case.input)
# Map spans to session finished_spans = memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Create test casestest_cases = [ Case[str, str]( name="knowledge-1", input="What is the capital of France?", metadata={"category": "knowledge"} ), Case[str, str]( name="knowledge-2", input="What color is the ocean?", metadata={"category": "knowledge"} ),]
# Create evaluatorevaluator = HelpfulnessEvaluator()
# Run evaluationexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(user_task_function)reports[0].run_display()Evaluation Output
Section titled “Evaluation Output”The HelpfulnessEvaluator returns EvaluationOutput objects with:
- score: Float between 0.0 and 1.0 (0.0, 0.167, 0.333, 0.5, 0.667, 0.833, or 1.0)
- test_pass:
Trueif score >= 0.5,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: One of the categorical labels (e.g., “Very helpful”, “Somewhat helpful”)
What Gets Evaluated
Section titled “What Gets Evaluated”The evaluator examines:
- Previous Turns: Earlier conversation context (if available)
- Target Turn: The user’s query and the agent’s response
- Helpfulness Factors:
- Relevance to the user’s query
- Completeness of the answer
- Clarity and understandability
- Actionability of the information
- Tone and professionalism
The judge determines how helpful the response is from the user’s perspective.
Best Practices
Section titled “Best Practices”- Use with Proper Telemetry Setup: The evaluator requires trajectory information captured via OpenTelemetry
- Provide User Context: Include conversation history for context-aware evaluation
- Test Diverse Scenarios: Include various query types and complexity levels
- Consider Domain-Specific Needs: Adjust expectations based on your use case
- Combine with Other Evaluators: Use alongside accuracy and faithfulness evaluators
Common Patterns
Section titled “Common Patterns”Pattern 1: Customer Service Evaluation
Section titled “Pattern 1: Customer Service Evaluation”Assess helpfulness of customer support responses.
Pattern 2: Information Retrieval
Section titled “Pattern 2: Information Retrieval”Evaluate if search or query responses meet user needs.
Pattern 3: Task Assistance
Section titled “Pattern 3: Task Assistance”Measure how well agents help users complete tasks.
Example Scenarios
Section titled “Example Scenarios”Scenario 1: Very Helpful Response
Section titled “Scenario 1: Very Helpful Response”User: "How do I reset my password?"Agent: "To reset your password: 1) Click 'Forgot Password' on the login page, 2) Enter your email, 3) Check your inbox for a reset link, 4) Follow the link and create a new password. The link expires in 24 hours."Evaluation: Very helpful (0.833) - Clear, complete, actionable stepsScenario 2: Somewhat Helpful Response
Section titled “Scenario 2: Somewhat Helpful Response”User: "How do I reset my password?"Agent: "You can reset your password through the login page."Evaluation: Somewhat helpful (0.667) - Correct but lacks detailScenario 3: Not Helpful Response
Section titled “Scenario 3: Not Helpful Response”User: "How do I reset my password?"Agent: "Passwords are important for security."Evaluation: Not helpful at all (0.0) - Doesn't address the questionScenario 4: Above and Beyond Response
Section titled “Scenario 4: Above and Beyond Response”User: "How do I reset my password?"Agent: "I'll help you reset your password. Here are the steps: 1) Click 'Forgot Password' on the login page, 2) Enter your email, 3) Check your inbox for a reset link (check spam if not found), 4) Follow the link and create a new password. Tips: Use a strong password with 12+ characters, mix of letters/numbers/symbols. If you don't receive the email within 5 minutes, let me know and I can help troubleshoot."Evaluation: Above and beyond (1.0) - Comprehensive, proactive, anticipates issuesCommon Issues and Solutions
Section titled “Common Issues and Solutions”Issue 1: No Evaluation Returned
Section titled “Issue 1: No Evaluation Returned”Problem: Evaluator returns empty results. Solution: Ensure trajectory contains at least one agent invocation span.
Issue 2: Inconsistent Scoring
Section titled “Issue 2: Inconsistent Scoring”Problem: Similar responses get different scores. Solution: This is expected due to LLM non-determinism. Run multiple evaluations and aggregate.
Issue 3: Context Not Considered
Section titled “Issue 3: Context Not Considered”Problem: Evaluation doesn’t account for conversation history.
Solution: Verify telemetry captures full conversation and include_inputs=True.
Differences from Other Evaluators
Section titled “Differences from Other Evaluators”- vs. FaithfulnessEvaluator: Helpfulness focuses on user value, faithfulness on factual grounding
- vs. OutputEvaluator: Helpfulness is user-centric, output evaluator uses custom rubrics
- vs. GoalSuccessRateEvaluator: Helpfulness evaluates individual turns, goal success evaluates overall achievement
Related Evaluators
Section titled “Related Evaluators”- FaithfulnessEvaluator: Evaluates if responses are grounded in context
- OutputEvaluator: Evaluates overall output quality with custom criteria
- GoalSuccessRateEvaluator: Evaluates if overall user goals were achieved
- TrajectoryEvaluator: Evaluates the sequence of actions taken