Built-in vs Custom Metrics
Snowglobe includes these built-in metrics:- Limit subject area - Keeps conversations within defined topics
- Hallucination - Detects factually incorrect information
- Content safety - Identifies harmful or offensive content
- No financial advice - Prevents unauthorized financial recommendations
- Self-harm - Detects discussions of self-harm or suicidal thoughts
- Domain-specific accuracy (medical facts, legal compliance, etc.)
- Brand voice and tone consistency
- Task completion success rates
- Custom safety or compliance requirements
- User satisfaction indicators
Choose Your Method
LM-Based Judge via Web UI | Code-Based Judge via Connector | |
---|---|---|
Best for | Most teams, quick setup, prompt-based evaluation | Advanced users with existing models or complex logic |
Pros | No coding required, easy to iterate, powerful LM reasoning on complex tasks | Maximum flexibility, use existing models, rule-based logic |
Cons | Limited to LM capabilities, may be slower, more expensive | Requires coding, more setup complexity |
Method 1: LM-Based Judge via Web UI
Perfect for most teams who want to define evaluation criteria through prompts.Step by Step Walkthrough
- Navigate to Metrics → Create LLM-as-a-Judge Metric
- Name your metric and add a short human-readable description of what you want to evaluate.
- Enter a prompt that describes what you want to evaluate. You can also use the “High-Level Criteria” field to describe your criteria and Snowglobe will generate the evaluation prompt for you.
- Enter a scoring scale and optionally, a list of tags to help organize your metric.
- Choose an LLM and parameters for the judge.
Writing Effective Evaluation Prompts
-
Be specific about criteria:
-
Include positive and negative examples: Show what excellent (5) and poor (1) responses look like.
-
Focus on observable behaviors:
-
Use clear scoring scale: Include explicit scoring criteria.
- Add relevant tags: Use tags to organize your metrics by category (accuracy, safety, brand, etc.)
Examples
These are two examples of how to write an evaluation prompt for evaluating brand voice consistency.Using High-Level Criteria
Using High-Level Criteria
Snowglobe automatically generates an appropriate evaluation prompt from a high-level criteria description.You can then edit the prompt to be more specific.
Using Custom Evaluation Prompt
Using Custom Evaluation Prompt
You can also write a custom evaluation prompt from scratch.
Method 2: Code-Based Judge via Connector
Designed for advanced users who need custom logic, existing models, or rule-based evaluation.Setup Requirements
1
Install the Connector
2
Generate Your Snowglobe API Key
Visit snowglobe.so/app/keys to generate your API key.
3
Authenticate the Connector
Initialize Metric and Implement Template
The connector creates this template for you to implement:What this command does
- Creates a template file (
metric_connector.py
) for your custom evaluation logic.
What to expect
Template FileImplementation Reference
Your function receives a request containing the simulation scenario.
Return your chatbot’s response as a structured object.
Test Your Metric
Connect Your Metric
What to Expect
Success Indicator
You should see a success indicator in the Snowglobe web UI.
Viewing Results
Each custom metric gets its own dashboard in the simulation results:- Individual metric view: Detailed breakdown of scores per conversation
- Overview dashboard: Performance across all metrics (built-in + custom)
- Conversation-level scoring: See how each conversation performed on your custom criteria
Resources
Join the Community
Connect with other developers, ask questions, and share feedback.
See Examples
See real examples of custom metrics in action.
Align Custom Metrics
Learn how to validate and improve your custom metric accuracy.
Integrate with CI/CD
Learn how to integrate your custom metric with your CI/CD pipeline.
Questions? Join our developer community or contact support for help creating effective custom metrics.