Custom metrics let you evaluate your chatbot based on criteria specific to your use case. While Snowglobe provides built-in metrics, custom metrics ensure you’re measuring what matters most for your chatbot’s success.

Built-in vs Custom Metrics

Snowglobe includes these built-in metrics:
  • Limit subject area - Keeps conversations within defined topics
  • Hallucination - Detects factually incorrect information
  • Content safety - Identifies harmful or offensive content
  • No financial advice - Prevents unauthorized financial recommendations
  • Self-harm - Detects discussions of self-harm or suicidal thoughts
Create custom metrics when you need to measure:
  • Domain-specific accuracy (medical facts, legal compliance, etc.)
  • Brand voice and tone consistency
  • Task completion success rates
  • Custom safety or compliance requirements
  • User satisfaction indicators

Choose Your Method

LM-Based Judge via Web UICode-Based Judge via Connector
Best forMost teams, quick setup, prompt-based evaluationAdvanced users with existing models or complex logic
ProsNo coding required, easy to iterate, powerful LM reasoning on complex tasksMaximum flexibility, use existing models, rule-based logic
ConsLimited to LM capabilities, may be slower, more expensiveRequires coding, more setup complexity

Method 1: LM-Based Judge via Web UI

Perfect for most teams who want to define evaluation criteria through prompts.

Step by Step Walkthrough

  1. Navigate to MetricsCreate LLM-as-a-Judge Metric
  2. Name your metric and add a short human-readable description of what you want to evaluate.
  3. Enter a prompt that describes what you want to evaluate. You can also use the “High-Level Criteria” field to describe your criteria and Snowglobe will generate the evaluation prompt for you.
  4. Enter a scoring scale and optionally, a list of tags to help organize your metric.
  5. Choose an LLM and parameters for the judge.

Writing Effective Evaluation Prompts

  • Be specific about criteria:
    Good:
    Rate how well the chatbot provides accurate medical information without giving diagnoses. Look for: factual accuracy, appropriate disclaimers, referrals to professionals when needed.
    
    Bad:
    Rate how good the medical advice is.
    
  • Include positive and negative examples: Show what excellent (5) and poor (1) responses look like.
    Example: Excellent (5): Chatbot provides accurate information about symptoms, includes disclaimer about not replacing professional medical advice, suggests consulting a doctor.
    
  • Focus on observable behaviors:
    Good:
    Rate based on: specific facts mentioned, sources cited, confidence level expressed
    
    Bad:
    Rate how knowledgeable the chatbot seems
    
  • Use clear scoring scale: Include explicit scoring criteria.
    Good:
    Rate based on factual accuracy. (1) Completely incorrect, (5) Completely correct
    
    Bad:
    Rate how good the medical advice is.
    
  • Add relevant tags: Use tags to organize your metrics by category (accuracy, safety, brand, etc.)

Examples

These are two examples of how to write an evaluation prompt for evaluating brand voice consistency.
Snowglobe automatically generates an appropriate evaluation prompt from a high-level criteria description.
Evaluate if the chatbot maintains our friendly, professional brand voice while being helpful but not overly casual.
You can then edit the prompt to be more specific.
You can also write a custom evaluation prompt from scratch.
Evaluate how well this conversation maintains the brand voice on a scale of 1-5.

Our brand voice should be:
- Friendly but professional
- Helpful and solution-oriented  
- Confident but not arrogant
- Uses "we" and "our" when referring to the company

Scoring:
5 (Excellent): Perfectly embodies brand voice throughout
4 (Good): Mostly on-brand with 1-2 minor deviations
3 (Fair): Generally on-brand but some inconsistencies
2 (Poor): Off-brand in several important ways
1 (Very Poor): Completely inconsistent with brand voice

Examples of excellent responses:
- "We're happy to help you troubleshoot this issue..."
- "Our team designed this feature to make your workflow easier..."

Examples of poor responses:  
- "I dunno, maybe try this..."
- "That's not my problem to solve..."

Method 2: Code-Based Judge via Connector

Designed for advanced users who need custom logic, existing models, or rule-based evaluation.

Setup Requirements

1

Install the Connector

uv pip install snowglobe
2

Generate Your Snowglobe API Key

Visit snowglobe.so/app/keys to generate your API key.
3

Authenticate the Connector

snowglobe-connect auth
When prompted, enter your API key.

Initialize Metric and Implement Template

The connector creates this template for you to implement:
snowglobe-connect init-metric

What this command does

  • Creates a template file (metric_connector.py) for your custom evaluation logic.

What to expect

Template File
from snowglobe_connector import ConversationEvaluation, MetricScore

def evaluate_conversation(conversation: ConversationEvaluation) -> MetricScore:
    """
    Evaluate a conversation based on your custom criteria.
    
    Args:
        conversation: The full conversation with messages and metadata
        
    Returns:
        MetricScore with 1-5 rating and optional explanation
    """
    
    # TODO: Implement your evaluation logic here
    # Access conversation.messages for the dialogue
    # Each message has: role, content, timestamp
    
    # Example stub - replace with your logic:
    score = calculate_your_metric(conversation)
    explanation = "Brief explanation of the score"
    
    return MetricScore(
        score=score,  # 1-5 integer
        explanation=explanation  # Optional reasoning
    )

def calculate_your_metric(conversation):
    # Your custom logic here
    return 3  # Placeholder

Implementation Reference

Input Format
RiskEvaluationRequest
Your function receives a request containing the simulation scenario.
Output Format
RiskEvaluationOutputs
required
Return your chatbot’s response as a structured object.

Test Your Metric

# Test your metric logic
snowglobe-connect test-metric

Connect Your Metric

snowglobe-connect connect-metric

What to Expect

$ Success!

Success Indicator

You should see a success indicator in the Snowglobe web UI.

Viewing Results

Each custom metric gets its own dashboard in the simulation results:
  • Individual metric view: Detailed breakdown of scores per conversation
  • Overview dashboard: Performance across all metrics (built-in + custom)
  • Conversation-level scoring: See how each conversation performed on your custom criteria

Resources


Questions? Join our developer community or contact support for help creating effective custom metrics.