Skip to main content
Custom metrics let you evaluate your chatbot based on criteria specific to your use case. While Botster provides built-in metrics, custom metrics ensure you’re measuring what matters most.

Built-in vs Custom Metrics

Built-in Metrics

Botster includes these out of the box:
  • Topic adherence — Keeps conversations within defined subjects
  • Hallucination — Detects factually incorrect information
  • Content safety — Identifies harmful or offensive content
  • Financial advice — Prevents unauthorized financial recommendations
  • Self-harm detection — Flags discussions of self-harm or suicidal thoughts

When to Create Custom Metrics

Create custom metrics when you need to measure:
  • Domain-specific accuracy (medical facts, legal compliance, etc.)
  • Brand voice and tone consistency
  • Task completion success rates
  • Custom safety or compliance requirements
  • User satisfaction indicators

Creating Custom Metrics

Custom metrics use an LLM-based judge to evaluate conversations based on your criteria.

Setup Steps

  1. Navigate to MetricsCreate Metric
  2. Name your metric and add a description
  3. Enter an evaluation prompt (or use “High-Level Criteria” and let Botster generate it)
  4. Define a scoring scale (e.g., 1-5)
  5. Optionally add tags to organize your metrics
  6. Choose an LLM and parameters for the judge

Writing Effective Evaluation Prompts

Be Specific About Criteria

Good:
“Rate how well the chatbot provides accurate medical information without giving diagnoses. Look for: factual accuracy, appropriate disclaimers, referrals to professionals when needed.”
Bad:
“Rate how good the medical advice is.”

Include Examples

“Excellent (5): Chatbot provides accurate information, includes disclaimers, suggests consulting a doctor. Poor (1): Gives specific diagnoses or contradicts medical consensus.”

Focus on Observable Behaviors

Good:
“Rate based on: specific facts mentioned, sources cited, confidence level expressed”
Bad:
“Rate how knowledgeable the chatbot seems”

Use Clear Scoring Scales

Good:
“Score 1-5 based on factual accuracy. 1 = Completely incorrect, 5 = Completely correct”
Bad:
“Rate the quality”

Example Metrics

Brand Voice Consistency

Using High-Level Criteria:
Criteria: Does the chatbot maintain our brand voice?
- Professional but approachable tone
- Avoids jargon and technical terms
- Uses "we" when referring to the company
- Empathetic when users express frustration

Response Accuracy

Using Custom Evaluation Prompt:
Evaluate whether the chatbot's response contains accurate information.

Scoring:
5 - All information is accurate and complete
4 - Mostly accurate with minor omissions
3 - Contains some inaccuracies that don't significantly mislead
2 - Contains notable inaccuracies or misleading information
1 - Response is largely inaccurate or harmful

Consider:
- Factual correctness of claims
- Completeness of the answer
- Appropriate uncertainty when information is unclear

Task Completion

Evaluate whether the user's goal was achieved.

Scoring:
5 - Goal fully achieved, user explicitly confirmed satisfaction
4 - Goal achieved with minor issues
3 - Goal partially achieved or unclear outcome
2 - Goal not achieved but user received helpful direction
1 - Goal not achieved, user left frustrated

Look for:
- Explicit confirmation from the user
- Successful completion of requested action
- Clear next steps provided when needed

Tone and Empathy

Evaluate the chatbot's tone when users express frustration.

Scoring:
5 - Highly empathetic, acknowledges feelings, offers solutions
4 - Empathetic response with appropriate tone
3 - Neutral tone, neither empathetic nor dismissive
2 - Slightly dismissive or overly formal
1 - Dismissive, robotic, or makes frustration worse

Viewing Results

Each custom metric gets its own dashboard in simulation results:
  • Individual metric view — Detailed breakdown of scores per conversation
  • Overview dashboard — Performance across all metrics (built-in + custom)
  • Conversation-level scoring — See how each conversation performed on your criteria
  • Distribution charts — See score patterns across all simulated conversations

Best Practices

  • Start with 2-3 metrics — Focus on your core goals first, then add more as needed
  • Be specific — Vague criteria lead to inconsistent evaluations
  • Use examples — Show what good and bad scores look like
  • Iterate — Review results and refine your prompts based on what you observe
  • Align with business goals — Measure what actually matters for your chatbot’s success

Next Steps