Comparing LLM Performance with Snowglobe

This guide shows you how to use Snowglobe to compare different LLM providers on identical conversation scenarios. Learn to objectively evaluate how different models handle the same interactions and make data-driven decisions about which LLM works best for your use case.

Overview

This approach allows you to:

A/B test different LLM providers on identical conversation scenarios
Compare model performance objectively using consistent test cases
Make data-driven LLM selection decisions based on real conversation data
Scale from simple prompts to complex chatbots using the same testing framework

Why Compare LLMs?

Different LLM providers excel at different tasks:

Response quality varies significantly between models
Conversation style differs (formal vs. casual, verbose vs. concise)
Domain expertise varies (technical, creative, analytical)
Cost and latency trade-offs affect production decisions
Consistency in following instructions varies

Snowglobe lets you test these differences systematically with real conversation scenarios.

Architecture

The comparison setup works by:

Creating a simple chatbot that can switch between LLM providers
Using Snowglobe to run identical test scenarios against each provider
Comparing responses to make informed provider selection decisions
Scaling the same approach to test complex multi-agent systems

Prerequisites

Python 3.10 or higher
API keys for the LLM providers you want to compare
openai, anthropic, and snowglobe packages installed

Installation

Install the required packages:

pip install openai anthropic snowglobe

Set up your environment variables for the LLM providers you want to test:

export OPENAI_API_KEY="your-openai-api-key-here"
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

Step 1: Create Your Snowglobe App

Before setting up any code:

Navigate to the Snowglobe web interface
Create a new app
Select the “Connect to local process” option

This will prepare Snowglobe to connect to your local chatbot.

Step 2: Authenticate with Snowglobe

Authenticate with Snowglobe:

snowglobe-connect auth

Step 3: Initialize the Snowglobe Connector

Initialize your connector:

snowglobe-connect init

When prompted about stateful connections, select n (No). This ensures each test scenario starts fresh without conversation history carryover.

This will create the necessary files and structure for your Snowglobe connector.

Step 4: Create Your LLM Comparison Framework

Replace the generated connector code with a framework that can switch between different LLM providers:

# main.py (or your connector file)
from openai import OpenAI
from anthropic import Anthropic
from snowglobe.client import CompletionRequest, CompletionFunctionOutputs
import logging
import os

LOGGER = logging.getLogger(__name__)

# Initialize clients
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

# Your chatbot configuration - customize this for your use case
CHATBOT_PROMPT = """You are a helpful customer service representative for TechCorp, a software company. 

Your guidelines:
- Always be polite and professional
- Ask clarifying questions when requests are unclear
- Offer specific solutions when possible
- If you can't help, politely direct them to the appropriate department
- Keep responses concise but helpful
- Use a friendly but professional tone"""

# Current provider configuration - change these to test different models
CURRENT_PROVIDER = "openai"  # "openai" or "anthropic"
CURRENT_MODEL = "gpt-4o-mini"  # Start with GPT-4o Mini

def call_openai(messages):
    """Call OpenAI API"""
    response = openai_client.chat.completions.create(
        model=CURRENT_MODEL,
        messages=messages,
        temperature=0.7,
        max_tokens=500
    )
    return response.choices[0].message.content

def call_anthropic(messages):
    """Call Anthropic API"""
    # Convert messages format for Anthropic
    system_message = ""
    conversation_messages = []
    
    for msg in messages:
        if msg["role"] == "system":
            system_message = msg["content"]
        else:
            conversation_messages.append({
                "role": msg["role"],
                "content": msg["content"]
            })
    
    response = anthropic_client.messages.create(
        model=CURRENT_MODEL,
        system=system_message,
        messages=conversation_messages,
        temperature=0.7,
        max_tokens=500
    )
    return response.content[0].text

def process_scenario(request: CompletionRequest) -> CompletionFunctionOutputs:
    """
    Process incoming Snowglobe scenarios and return LLM responses.
    
    This is where you can customize your chatbot logic:
    - Add complex conversation handling
    - Integrate with databases or APIs
    - Implement multi-step workflows
    - Add custom business logic
    """
    
    try:
        # Build conversation with your chatbot prompt
        messages = [{"role": "system", "content": CHATBOT_PROMPT}]
        
        for msg in request.messages:
            messages.append({
                "role": msg.role,
                "content": msg.content
            })
        
        # Call the current LLM provider
        if CURRENT_PROVIDER == "openai":
            ai_response = call_openai(messages)
        elif CURRENT_PROVIDER == "anthropic":
            ai_response = call_anthropic(messages)
        else:
            raise ValueError(f"Unknown provider: {CURRENT_PROVIDER}")
        
        print(f"Provider: {CURRENT_PROVIDER} | Model: {CURRENT_MODEL}")
        print(f"Response: {ai_response}")
        print("---")
        
        return CompletionFunctionOutputs(response=ai_response)
        
    except Exception as e:
        LOGGER.error(f"Error with {CURRENT_PROVIDER}/{CURRENT_MODEL}: {str(e)}")
        return CompletionFunctionOutputs(
            response="I'm sorry, I'm experiencing technical difficulties. Please try again later."
        )

Scaling Beyond Simple Prompts: This same framework works for complex chatbots. You can add database connections, API integrations, multi-step workflows, or any custom logic in the process_scenario function. The key is that Snowglobe will run identical scenarios against whatever system you build.

Step 5: Test Your Setup

Test your connector to verify it’s working with GPT-4o Mini:

snowglobe-connect test

You should see a successful test response from the GPT-4o Mini model via the OpenAI SDK.

Step 6: Start Your First Test Run

Start the Snowglobe connector:

snowglobe-connect start

Keep this connector running throughout your testing session.

Step 7: Run Your First Comparison (GPT-4o Mini)

Now run your first set of test scenarios:

Navigate to the Snowglobe web UI
Select your app and create test scenarios that matter for your use case:
- Customer service inquiries
- Technical support questions
- Sales conversations
- Complex multi-turn dialogues
- Edge cases and difficult requests
Launch the simulation

Document the results:

Response quality: How helpful and accurate are the responses?
Conversation flow: Does the model maintain context well?
Tone and style: Does it match your desired brand voice?
Edge case handling: How does it handle unusual requests?
Consistency: Are responses reliable across similar scenarios?

Step 8: Switch to Claude Sonnet 4

Now let’s compare with Claude Sonnet 4. Stop your current connector (Ctrl+C) and update your provider configuration:

# Update these lines in your main.py file
CURRENT_PROVIDER = "anthropic"  # Switch to Anthropic
CURRENT_MODEL = "claude-3-5-sonnet-20241022"  # Switch to Sonnet 4

Step 9: Test the New Model

Test your updated connector:

snowglobe-connect test

Verify that Sonnet 4 is now responding.

Step 10: Start Your Second Test Run

Restart the connector with the new model:

snowglobe-connect start

Step 11: Run Your Comparison (Claude Sonnet 4)

Run the exact same test scenarios from your first comparison:

Use identical scenarios from your GPT-4o Mini test
Launch the simulation in the Snowglobe UI
Document the differences you observe

Making Your LLM Decision

Now you have data from both models on identical scenarios. Compare them across dimensions that matter for your use case:

Response Quality Comparison

Accuracy: Which model provides more correct information?
Helpfulness: Which responses better solve user problems?
Completeness: Which model provides appropriate level of detail?

Conversation Style Comparison

Tone: Which model better matches your desired brand voice?
Clarity: Which responses are easier to understand?
Conciseness: Which model finds the right balance of detail?

Reliability Comparison

Consistency: Which model gives similar quality responses across scenarios?
Instruction following: Which model better follows your guidelines?
Edge case handling: Which model gracefully handles unusual requests?

Practical Considerations

Cost: Compare API costs for your expected usage
Latency: Which model responds faster for your needs?
Rate limits: Which model’s limits work for your scale?

Expanding Your Comparisons

Testing More Models

You can easily test additional models by updating the provider and model configuration:

# OpenAI models:
CURRENT_PROVIDER = "openai"
CURRENT_MODEL = "gpt-4o"           # GPT-4o
# CURRENT_MODEL = "gpt-4o-mini"    # GPT-4o Mini  
# CURRENT_MODEL = "gpt-4-turbo"    # GPT-4 Turbo

# Anthropic models:
CURRENT_PROVIDER = "anthropic"
CURRENT_MODEL = "claude-3-5-sonnet-20241022"  # Claude 3.5 Sonnet
# CURRENT_MODEL = "claude-3-haiku-20240307"   # Claude 3 Haiku
# CURRENT_MODEL = "claude-3-opus-20240229"    # Claude 3 Opus

Scaling to Complex Systems

This same comparison framework scales to test: Complex Chatbots: Add database connections, API integrations, and multi-step workflows

def process_scenario(request: CompletionRequest) -> CompletionFunctionOutputs:
    # Add your complex chatbot logic here:
    # - Database lookups
    # - API integrations  
    # - Multi-agent coordination
    # - Custom business logic
    
    # The LLM comparison happens at the end
    if CURRENT_PROVIDER == "openai":
        return call_openai_with_context(messages, context_data)
    elif CURRENT_PROVIDER == "anthropic":
        return call_anthropic_with_context(messages, context_data)

Multi-Agent Systems: Compare how different LLMs perform in agent roles RAG Systems: Test which LLM better handles your retrieval-augmented generation Custom Workflows: Evaluate LLMs on your specific business processes

Make sure you have the appropriate API keys set up for each provider you want to test.

Advanced Testing

Advanced Comparison Techniques

Testing Configuration Variations

Compare the same LLM with different settings:

# Temperature comparison:
response = openai_client.chat.completions.create(
    model=CURRENT_MODEL,
    messages=messages,
    temperature=0.1,  # Conservative vs 0.9 (Creative)
    max_tokens=500
)

# System prompt variations:
CHATBOT_PROMPT_V1 = "You are helpful and concise..."
CHATBOT_PROMPT_V2 = "You are helpful and detailed..."

Creating Custom Test Scenarios

Design scenarios that test specific capabilities:

Domain expertise: Technical questions in your field
Conversation memory: Multi-turn conversations with context
Edge cases: Unusual requests or error conditions
Brand voice: Scenarios that test tone and personality
Complex reasoning: Multi-step problem solving

Troubleshooting

OpenAI SDK connection issues:

Verify your OPENAI_API_KEY is correctly set
Check that you have quota/credits with OpenAI
Ensure the model name is correct for OpenAI

Anthropic SDK connection issues:

Verify your ANTHROPIC_API_KEY is correctly set
Check that you have quota/credits with Anthropic
Ensure the model name is correct for Anthropic

Model not responding as expected:

Review your system prompt for clarity
Check if the model supports the features you’re using
Verify temperature and max_tokens settings

Snowglobe connection problems:

Ensure your connector is running (snowglobe-connect start)
Check that your app is properly configured in the Snowglobe UI
Review logs for any error messages

This comparison framework gives you objective data to make LLM selection decisions. Whether you’re choosing between providers for a simple chatbot or evaluating complex multi-agent systems, Snowglobe ensures you’re comparing apples to apples with identical test scenarios.The same approach works for any scale - from testing system prompts to evaluating enterprise-grade conversational AI systems.

Getting Started

How-to Guides

Examples & Showcase

Snowglobe Connect Reference

Support

LLM Applications

Comparing LLM Performance with Snowglobe

Overview

Why Compare LLMs?

Architecture

Prerequisites

Installation

Step 1: Create Your Snowglobe App

Step 2: Authenticate with Snowglobe

Step 3: Initialize the Snowglobe Connector

Step 4: Create Your LLM Comparison Framework

Step 5: Test Your Setup

Step 6: Start Your First Test Run

Step 7: Run Your First Comparison (GPT-4o Mini)

Step 8: Switch to Claude Sonnet 4

Step 9: Test the New Model

Step 10: Start Your Second Test Run

Step 11: Run Your Comparison (Claude Sonnet 4)

Making Your LLM Decision

Response Quality Comparison

Conversation Style Comparison

Reliability Comparison

Practical Considerations

Expanding Your Comparisons

Testing More Models

Scaling to Complex Systems

Advanced Testing

Advanced Comparison Techniques

Testing Configuration Variations

Creating Custom Test Scenarios

Troubleshooting

Getting Started

How-to Guides

Examples & Showcase

Snowglobe Connect Reference

Support

​Comparing LLM Performance with Snowglobe

​Overview

​Why Compare LLMs?

​Architecture

​Prerequisites

​Installation

​Step 1: Create Your Snowglobe App

​Step 2: Authenticate with Snowglobe

​Step 3: Initialize the Snowglobe Connector

​Step 4: Create Your LLM Comparison Framework

​Step 5: Test Your Setup

​Step 6: Start Your First Test Run

​Step 7: Run Your First Comparison (GPT-4o Mini)

​Step 8: Switch to Claude Sonnet 4

​Step 9: Test the New Model

​Step 10: Start Your Second Test Run

​Step 11: Run Your Comparison (Claude Sonnet 4)

​Making Your LLM Decision

​Response Quality Comparison

​Conversation Style Comparison

​Reliability Comparison

​Practical Considerations

​Expanding Your Comparisons

​Testing More Models

​Scaling to Complex Systems

​Advanced Testing

​Advanced Comparison Techniques

​Testing Configuration Variations

​Creating Custom Test Scenarios

​Troubleshooting

Comparing LLM Performance with Snowglobe

Overview

Why Compare LLMs?

Architecture

Prerequisites

Installation

Step 1: Create Your Snowglobe App

Step 2: Authenticate with Snowglobe

Step 3: Initialize the Snowglobe Connector

Step 4: Create Your LLM Comparison Framework

Step 5: Test Your Setup

Step 6: Start Your First Test Run

Step 7: Run Your First Comparison (GPT-4o Mini)

Step 8: Switch to Claude Sonnet 4

Step 9: Test the New Model

Step 10: Start Your Second Test Run

Step 11: Run Your Comparison (Claude Sonnet 4)

Making Your LLM Decision

Response Quality Comparison

Conversation Style Comparison

Reliability Comparison

Practical Considerations

Expanding Your Comparisons

Testing More Models

Scaling to Complex Systems

Advanced Testing

Advanced Comparison Techniques

Testing Configuration Variations

Creating Custom Test Scenarios

Troubleshooting