Custom metrics let you evaluate your chatbot based on criteria specific to your use case. While Snowglobe provides built-in metrics, custom metrics ensure you’re measuring what matters most for your chatbot’s success.

Built-in vs Custom Metrics

Snowglobe includes these built-in metrics:
  • Limit subject area - Keeps conversations within defined topics
  • Hallucination - Detects factually incorrect information
  • Content safety - Identifies harmful or offensive content
  • No financial advice - Prevents unauthorized financial recommendations
  • Self-harm - Detects discussions of self-harm or suicidal thoughts
Create custom metrics when you need to measure:
  • Domain-specific accuracy (medical facts, legal compliance, etc.)
  • Brand voice and tone consistency
  • Task completion success rates
  • Custom safety or compliance requirements
  • User satisfaction indicators

Choose Your Method

LM-Based Judge via Web UICode-Based Judge via Connector
Best forMost teams, quick setup, prompt-based evaluationAdvanced users with existing models or complex logic
ProsNo coding required, easy to iterate, powerful LM reasoning on complex tasksMaximum flexibility, use existing models, rule-based logic
ConsLimited to LM capabilities, may be slower, more expensiveRequires coding, more setup complexity

Method 1: LM-Based Judge via Web UI

Perfect for most teams who want to define evaluation criteria through prompts.

Step by Step Walkthrough

  1. Navigate to MetricsCreate LLM-as-a-Judge Metric
  2. Name your metric and add a short human-readable description of what you want to evaluate.
  3. Enter a prompt that describes what you want to evaluate. You can also use the “High-Level Criteria” field to describe your criteria and Snowglobe will generate the evaluation prompt for you.
  4. Enter a scoring scale and optionally, a list of tags to help organize your metric.
  5. Choose an LLM and parameters for the judge.

Writing Effective Evaluation Prompts

  • Be specific about criteria:
    Good:
    Rate how well the chatbot provides accurate medical information without giving diagnoses. Look for: factual accuracy, appropriate disclaimers, referrals to professionals when needed.
    
    Bad:
    Rate how good the medical advice is.
    
  • Include positive and negative examples: Show what excellent (5) and poor (1) responses look like.
    Example: Excellent (5): Chatbot provides accurate information about symptoms, includes disclaimer about not replacing professional medical advice, suggests consulting a doctor.
    
  • Focus on observable behaviors:
    Good:
    Rate based on: specific facts mentioned, sources cited, confidence level expressed
    
    Bad:
    Rate how knowledgeable the chatbot seems
    
  • Use clear scoring scale: Include explicit scoring criteria.
    Good:
    Rate based on factual accuracy. (1) Completely incorrect, (5) Completely correct
    
    Bad:
    Rate how good the medical advice is.
    
  • Add relevant tags: Use tags to organize your metrics by category (accuracy, safety, brand, etc.)

Examples

These are two examples of how to write an evaluation prompt for evaluating brand voice consistency.

Method 2: Code-Based Judge via Connector

Designed for advanced users who need custom logic, existing models, or rule-based evaluation.

Setup Requirements

1

Install the Connector

uv pip install snowglobeor
2

Generate Your Snowglobe API Key

Visit snowglobe.so/app/keys to generate your API key.
3

Authenticate the Connector

snowglobeor auth
When prompted, enter your API key.

Initialize Metric and Implement Template

The connector creates this template for you to implement:
snowglobeor init-metric

What this command does

  • Creates a template file (metric_connector.py) for your custom evaluation logic.

What to expect

Template File
from snowglobe_connector import ConversationEvaluation, MetricScore

def evaluate_conversation(conversation: ConversationEvaluation) -> MetricScore:
    """
    Evaluate a conversation based on your custom criteria.
    
    Args:
        conversation: The full conversation with messages and metadata
        
    Returns:
        MetricScore with 1-5 rating and optional explanation
    """
    
    # TODO: Implement your evaluation logic here
    # Access conversation.messages for the dialogue
    # Each message has: role, content, timestamp
    
    # Example stub - replace with your logic:
    score = calculate_your_metric(conversation)
    explanation = "Brief explanation of the score"
    
    return MetricScore(
        score=score,  # 1-5 integer
        explanation=explanation  # Optional reasoning
    )

def calculate_your_metric(conversation):
    # Your custom logic here
    return 3  # Placeholder

Implementation Reference

Input Format
RiskEvaluationRequest
Your function receives a request containing the simulation scenario.
Output Format
RiskEvaluationOutputs
required
Return your chatbot’s response as a structured object.

Test Your Metric

# Test your metric logic
snowglobeor test-metric

Connect Your Metric

snowglobeor connect-metric

What to Expect

$ Success!

Success Indicator

You should see a success indicator in the Snowglobe web UI.

Viewing Results

Each custom metric gets its own dashboard in the simulation results:
  • Individual metric view: Detailed breakdown of scores per conversation
  • Overview dashboard: Performance across all metrics (built-in + custom)
  • Conversation-level scoring: See how each conversation performed on your custom criteria

Resources


Questions? Join our developer community or contact support for help creating effective custom metrics.