Create Custom Metrics

Custom metrics let you evaluate your chatbot based on criteria specific to your use case. While Snowglobe provides built-in metrics, custom metrics ensure you’re measuring what matters most for your chatbot’s success.

Built-in vs Custom Metrics

Snowglobe includes these built-in metrics:

Limit subject area - Keeps conversations within defined topics
Hallucination - Detects factually incorrect information
Content safety - Identifies harmful or offensive content
No financial advice - Prevents unauthorized financial recommendations
Self-harm - Detects discussions of self-harm or suicidal thoughts

Create custom metrics when you need to measure:

Domain-specific accuracy (medical facts, legal compliance, etc.)
Brand voice and tone consistency
Task completion success rates
Custom safety or compliance requirements
User satisfaction indicators

Choose Your Method

	LM-Based Judge via Web UI	Code-Based Judge via Connector
Best for	Most teams, quick setup, prompt-based evaluation	Advanced users with existing models or complex logic
Pros	No coding required, easy to iterate, powerful LM reasoning on complex tasks	Maximum flexibility, use existing models, rule-based logic
Cons	Limited to LM capabilities, may be slower, more expensive	Requires coding, more setup complexity

Method 1: LM-Based Judge via Web UI

Perfect for most teams who want to define evaluation criteria through prompts.

Step by Step Walkthrough

Navigate to Metrics → Create LLM-as-a-Judge Metric
Name your metric and add a short human-readable description of what you want to evaluate.
Enter a prompt that describes what you want to evaluate. You can also use the “High-Level Criteria” field to describe your criteria and Snowglobe will generate the evaluation prompt for you.
Enter a scoring scale and optionally, a list of tags to help organize your metric.
Choose an LLM and parameters for the judge.

Writing Effective Evaluation Prompts

Be specific about criteria:

Good:
Rate how well the chatbot provides accurate medical information without giving diagnoses. Look for: factual accuracy, appropriate disclaimers, referrals to professionals when needed.

Bad:
Rate how good the medical advice is.

Include positive and negative examples: Show what excellent (5) and poor (1) responses look like.

Example: Excellent (5): Chatbot provides accurate information about symptoms, includes disclaimer about not replacing professional medical advice, suggests consulting a doctor.

Focus on observable behaviors:

Good:
Rate based on: specific facts mentioned, sources cited, confidence level expressed

Bad:
Rate how knowledgeable the chatbot seems

Use clear scoring scale: Include explicit scoring criteria.

Good:
Rate based on factual accuracy. (1) Completely incorrect, (5) Completely correct

Bad:
Rate how good the medical advice is.

Add relevant tags: Use tags to organize your metrics by category (accuracy, safety, brand, etc.)

Examples

These are two examples of how to write an evaluation prompt for evaluating brand voice consistency.

Using High-Level Criteria

Snowglobe automatically generates an appropriate evaluation prompt from a high-level criteria description.

Evaluate if the chatbot maintains our friendly, professional brand voice while being helpful but not overly casual.

You can then edit the prompt to be more specific.

Using Custom Evaluation Prompt

You can also write a custom evaluation prompt from scratch.

Evaluate how well this conversation maintains the brand voice on a scale of 1-5.

Our brand voice should be:
- Friendly but professional
- Helpful and solution-oriented  
- Confident but not arrogant
- Uses "we" and "our" when referring to the company

Scoring:
5 (Excellent): Perfectly embodies brand voice throughout
4 (Good): Mostly on-brand with 1-2 minor deviations
3 (Fair): Generally on-brand but some inconsistencies
2 (Poor): Off-brand in several important ways
1 (Very Poor): Completely inconsistent with brand voice

Examples of excellent responses:
- "We're happy to help you troubleshoot this issue..."
- "Our team designed this feature to make your workflow easier..."

Examples of poor responses:  
- "I dunno, maybe try this..."
- "That's not my problem to solve..."

Method 2: Code-Based Judge via Connector

Designed for advanced users who need custom logic, existing models, or rule-based evaluation.

Setup Requirements

Install the Connector

uv pip install snowglobe

Generate Your Snowglobe API Key

Visit snowglobe.so/app/keys to generate your API key.

Authenticate the Connector

snowglobe-connect auth

When prompted, enter your API key.

Initialize Metric and Implement Template

The connector creates this template for you to implement:

snowglobe-connect init-metric

What this command does

Creates a template file (metric_connector.py) for your custom evaluation logic.

What to expect

Template File

from snowglobe_connector import ConversationEvaluation, MetricScore

def evaluate_conversation(conversation: ConversationEvaluation) -> MetricScore:
    """
    Evaluate a conversation based on your custom criteria.
    
    Args:
        conversation: The full conversation with messages and metadata
        
    Returns:
        MetricScore with 1-5 rating and optional explanation
    """
    
    # TODO: Implement your evaluation logic here
    # Access conversation.messages for the dialogue
    # Each message has: role, content, timestamp
    
    # Example stub - replace with your logic:
    score = calculate_your_metric(conversation)
    explanation = "Brief explanation of the score"
    
    return MetricScore(
        score=score,  # 1-5 integer
        explanation=explanation  # Optional reasoning
    )

def calculate_your_metric(conversation):
    # Your custom logic here
    return 3  # Placeholder

Implementation Reference

Input Format

RiskEvaluationRequest

Your function receives a request containing the simulation scenario.

Show RiskEvaluationRequest structure

RiskEvaluationRequest(
    messages=[
        SnowglobeMessage(
            role="user", 
            content="Help me with this math problem...",
            snowglobe_data=None  # Optional simulation metadata
        )
    ]
)

Output Format

RiskEvaluationOutputs

required

Return your chatbot’s response as a structured object.

Show RiskEvaluationOutputs structure

RiskEvaluationOutputs(
  triggered=True,
  tags=["brand_voice", "tone", "consistency"],
  reason="The chatbot's response is not consistent with the brand voice.",
  severity=4
)

Test Your Metric

# Test your metric logic
snowglobe-connect test-metric

Connect Your Metric

snowglobe-connect connect-metric

What to Expect

$ Success!

Success Indicator

You should see a success indicator in the Snowglobe web UI.

Viewing Results

Each custom metric gets its own dashboard in the simulation results:

Individual metric view: Detailed breakdown of scores per conversation
Overview dashboard: Performance across all metrics (built-in + custom)
Conversation-level scoring: See how each conversation performed on your custom criteria

Resources

Join the Community

Connect with other developers, ask questions, and share feedback.

See Examples

See real examples of custom metrics in action.

Align Custom Metrics

Learn how to validate and improve your custom metric accuracy.

Integrate with CI/CD

Learn how to integrate your custom metric with your CI/CD pipeline.

Questions? Join our developer community or contact support for help creating effective custom metrics.

Getting Started

How-to Guides

Examples & Showcase

Snowglobe Connect Reference

Support

Built-in vs Custom Metrics

Choose Your Method

Method 1: LM-Based Judge via Web UI

Step by Step Walkthrough

Writing Effective Evaluation Prompts

Examples

Method 2: Code-Based Judge via Connector

Setup Requirements

Initialize Metric and Implement Template

What this command does

What to expect

Implementation Reference

Test Your Metric

Connect Your Metric

What to Expect

Success Indicator

Viewing Results

Resources

Join the Community

See Examples

Align Custom Metrics

Integrate with CI/CD

Getting Started

How-to Guides

Examples & Showcase

Snowglobe Connect Reference

Support

​Built-in vs Custom Metrics

​Choose Your Method

​Method 1: LM-Based Judge via Web UI

​Step by Step Walkthrough

​Writing Effective Evaluation Prompts

​Examples

​Method 2: Code-Based Judge via Connector

​Setup Requirements

​Initialize Metric and Implement Template

​What this command does

​What to expect

​Implementation Reference

​Test Your Metric

​Connect Your Metric

​What to Expect

​Success Indicator

​Viewing Results

​Resources

Join the Community

See Examples

Align Custom Metrics

Integrate with CI/CD

Built-in vs Custom Metrics

Choose Your Method

Method 1: LM-Based Judge via Web UI

Step by Step Walkthrough

Writing Effective Evaluation Prompts

Examples

Method 2: Code-Based Judge via Connector

Setup Requirements

Initialize Metric and Implement Template

What this command does

What to expect

Implementation Reference

Test Your Metric

Connect Your Metric

What to Expect

Success Indicator

Viewing Results

Resources