Comparing LLM Performance with Snowglobe

This guide shows you how to use Snowglobe to compare different LLM providers on identical conversation scenarios. Learn to objectively evaluate how different models handle the same interactions and make data-driven decisions about which LLM works best for your use case.

Overview

This approach allows you to:
  • A/B test different LLM providers on identical conversation scenarios
  • Compare model performance objectively using consistent test cases
  • Make data-driven LLM selection decisions based on real conversation data
  • Scale from simple prompts to complex chatbots using the same testing framework

Why Compare LLMs?

Different LLM providers excel at different tasks:
  • Response quality varies significantly between models
  • Conversation style differs (formal vs. casual, verbose vs. concise)
  • Domain expertise varies (technical, creative, analytical)
  • Cost and latency trade-offs affect production decisions
  • Consistency in following instructions varies
Snowglobe lets you test these differences systematically with real conversation scenarios.

Architecture

The comparison setup works by:
  1. Creating a simple chatbot that can switch between LLM providers
  2. Using Snowglobe to run identical test scenarios against each provider
  3. Comparing responses to make informed provider selection decisions
  4. Scaling the same approach to test complex multi-agent systems

Prerequisites

  • Python 3.10 or higher
  • API keys for the LLM providers you want to compare
  • openai, anthropic, and snowglobe packages installed

Installation

Install the required packages:
pip install openai anthropic snowglobe
Set up your environment variables for the LLM providers you want to test:
export OPENAI_API_KEY="your-openai-api-key-here"
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

Step 1: Create Your Snowglobe App

Before setting up any code:
  1. Navigate to the Snowglobe web interface
  2. Create a new app
  3. Select the “Connect to local process” option
This will prepare Snowglobe to connect to your local chatbot.

Step 2: Authenticate with Snowglobe

Authenticate with Snowglobe:
snowglobe-connect auth

Step 3: Initialize the Snowglobe Connector

Initialize your connector:
snowglobe-connect init
When prompted about stateful connections, select n (No). This ensures each test scenario starts fresh without conversation history carryover.
This will create the necessary files and structure for your Snowglobe connector.

Step 4: Create Your LLM Comparison Framework

Replace the generated connector code with a framework that can switch between different LLM providers:
# main.py (or your connector file)
from openai import OpenAI
from anthropic import Anthropic
from snowglobe.client import CompletionRequest, CompletionFunctionOutputs
import logging
import os

LOGGER = logging.getLogger(__name__)

# Initialize clients
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

# Your chatbot configuration - customize this for your use case
CHATBOT_PROMPT = """You are a helpful customer service representative for TechCorp, a software company. 

Your guidelines:
- Always be polite and professional
- Ask clarifying questions when requests are unclear
- Offer specific solutions when possible
- If you can't help, politely direct them to the appropriate department
- Keep responses concise but helpful
- Use a friendly but professional tone"""

# Current provider configuration - change these to test different models
CURRENT_PROVIDER = "openai"  # "openai" or "anthropic"
CURRENT_MODEL = "gpt-4o-mini"  # Start with GPT-4o Mini

def call_openai(messages):
    """Call OpenAI API"""
    response = openai_client.chat.completions.create(
        model=CURRENT_MODEL,
        messages=messages,
        temperature=0.7,
        max_tokens=500
    )
    return response.choices[0].message.content

def call_anthropic(messages):
    """Call Anthropic API"""
    # Convert messages format for Anthropic
    system_message = ""
    conversation_messages = []
    
    for msg in messages:
        if msg["role"] == "system":
            system_message = msg["content"]
        else:
            conversation_messages.append({
                "role": msg["role"],
                "content": msg["content"]
            })
    
    response = anthropic_client.messages.create(
        model=CURRENT_MODEL,
        system=system_message,
        messages=conversation_messages,
        temperature=0.7,
        max_tokens=500
    )
    return response.content[0].text

def process_scenario(request: CompletionRequest) -> CompletionFunctionOutputs:
    """
    Process incoming Snowglobe scenarios and return LLM responses.
    
    This is where you can customize your chatbot logic:
    - Add complex conversation handling
    - Integrate with databases or APIs
    - Implement multi-step workflows
    - Add custom business logic
    """
    
    try:
        # Build conversation with your chatbot prompt
        messages = [{"role": "system", "content": CHATBOT_PROMPT}]
        
        for msg in request.messages:
            messages.append({
                "role": msg.role,
                "content": msg.content
            })
        
        # Call the current LLM provider
        if CURRENT_PROVIDER == "openai":
            ai_response = call_openai(messages)
        elif CURRENT_PROVIDER == "anthropic":
            ai_response = call_anthropic(messages)
        else:
            raise ValueError(f"Unknown provider: {CURRENT_PROVIDER}")
        
        print(f"Provider: {CURRENT_PROVIDER} | Model: {CURRENT_MODEL}")
        print(f"Response: {ai_response}")
        print("---")
        
        return CompletionFunctionOutputs(response=ai_response)
        
    except Exception as e:
        LOGGER.error(f"Error with {CURRENT_PROVIDER}/{CURRENT_MODEL}: {str(e)}")
        return CompletionFunctionOutputs(
            response="I'm sorry, I'm experiencing technical difficulties. Please try again later."
        )
Scaling Beyond Simple Prompts: This same framework works for complex chatbots. You can add database connections, API integrations, multi-step workflows, or any custom logic in the process_scenario function. The key is that Snowglobe will run identical scenarios against whatever system you build.

Step 5: Test Your Setup

Test your connector to verify it’s working with GPT-4o Mini:
snowglobe-connect test
You should see a successful test response from the GPT-4o Mini model via the OpenAI SDK.

Step 6: Start Your First Test Run

Start the Snowglobe connector:
snowglobe-connect start
Keep this connector running throughout your testing session.

Step 7: Run Your First Comparison (GPT-4o Mini)

Now run your first set of test scenarios:
  1. Navigate to the Snowglobe web UI
  2. Select your app and create test scenarios that matter for your use case:
    • Customer service inquiries
    • Technical support questions
    • Sales conversations
    • Complex multi-turn dialogues
    • Edge cases and difficult requests
  3. Launch the simulation
Document the results:
  • Response quality: How helpful and accurate are the responses?
  • Conversation flow: Does the model maintain context well?
  • Tone and style: Does it match your desired brand voice?
  • Edge case handling: How does it handle unusual requests?
  • Consistency: Are responses reliable across similar scenarios?

Step 8: Switch to Claude Sonnet 4

Now let’s compare with Claude Sonnet 4. Stop your current connector (Ctrl+C) and update your provider configuration:
# Update these lines in your main.py file
CURRENT_PROVIDER = "anthropic"  # Switch to Anthropic
CURRENT_MODEL = "claude-3-5-sonnet-20241022"  # Switch to Sonnet 4

Step 9: Test the New Model

Test your updated connector:
snowglobe-connect test
Verify that Sonnet 4 is now responding.

Step 10: Start Your Second Test Run

Restart the connector with the new model:
snowglobe-connect start

Step 11: Run Your Comparison (Claude Sonnet 4)

Run the exact same test scenarios from your first comparison:
  1. Use identical scenarios from your GPT-4o Mini test
  2. Launch the simulation in the Snowglobe UI
  3. Document the differences you observe

Making Your LLM Decision

Now you have data from both models on identical scenarios. Compare them across dimensions that matter for your use case:

Response Quality Comparison

  • Accuracy: Which model provides more correct information?
  • Helpfulness: Which responses better solve user problems?
  • Completeness: Which model provides appropriate level of detail?

Conversation Style Comparison

  • Tone: Which model better matches your desired brand voice?
  • Clarity: Which responses are easier to understand?
  • Conciseness: Which model finds the right balance of detail?

Reliability Comparison

  • Consistency: Which model gives similar quality responses across scenarios?
  • Instruction following: Which model better follows your guidelines?
  • Edge case handling: Which model gracefully handles unusual requests?

Practical Considerations

  • Cost: Compare API costs for your expected usage
  • Latency: Which model responds faster for your needs?
  • Rate limits: Which model’s limits work for your scale?

Expanding Your Comparisons

Testing More Models

You can easily test additional models by updating the provider and model configuration:
# OpenAI models:
CURRENT_PROVIDER = "openai"
CURRENT_MODEL = "gpt-4o"           # GPT-4o
# CURRENT_MODEL = "gpt-4o-mini"    # GPT-4o Mini  
# CURRENT_MODEL = "gpt-4-turbo"    # GPT-4 Turbo

# Anthropic models:
CURRENT_PROVIDER = "anthropic"
CURRENT_MODEL = "claude-3-5-sonnet-20241022"  # Claude 3.5 Sonnet
# CURRENT_MODEL = "claude-3-haiku-20240307"   # Claude 3 Haiku
# CURRENT_MODEL = "claude-3-opus-20240229"    # Claude 3 Opus

Scaling to Complex Systems

This same comparison framework scales to test: Complex Chatbots: Add database connections, API integrations, and multi-step workflows
def process_scenario(request: CompletionRequest) -> CompletionFunctionOutputs:
    # Add your complex chatbot logic here:
    # - Database lookups
    # - API integrations  
    # - Multi-agent coordination
    # - Custom business logic
    
    # The LLM comparison happens at the end
    if CURRENT_PROVIDER == "openai":
        return call_openai_with_context(messages, context_data)
    elif CURRENT_PROVIDER == "anthropic":
        return call_anthropic_with_context(messages, context_data)
Multi-Agent Systems: Compare how different LLMs perform in agent roles RAG Systems: Test which LLM better handles your retrieval-augmented generation Custom Workflows: Evaluate LLMs on your specific business processes
Make sure you have the appropriate API keys set up for each provider you want to test.

Advanced Testing

Advanced Comparison Techniques

Testing Configuration Variations

Compare the same LLM with different settings:
# Temperature comparison:
response = openai_client.chat.completions.create(
    model=CURRENT_MODEL,
    messages=messages,
    temperature=0.1,  # Conservative vs 0.9 (Creative)
    max_tokens=500
)

# System prompt variations:
CHATBOT_PROMPT_V1 = "You are helpful and concise..."
CHATBOT_PROMPT_V2 = "You are helpful and detailed..."

Creating Custom Test Scenarios

Design scenarios that test specific capabilities:
  • Domain expertise: Technical questions in your field
  • Conversation memory: Multi-turn conversations with context
  • Edge cases: Unusual requests or error conditions
  • Brand voice: Scenarios that test tone and personality
  • Complex reasoning: Multi-step problem solving

Troubleshooting

OpenAI SDK connection issues:
  • Verify your OPENAI_API_KEY is correctly set
  • Check that you have quota/credits with OpenAI
  • Ensure the model name is correct for OpenAI
Anthropic SDK connection issues:
  • Verify your ANTHROPIC_API_KEY is correctly set
  • Check that you have quota/credits with Anthropic
  • Ensure the model name is correct for Anthropic
Model not responding as expected:
  • Review your system prompt for clarity
  • Check if the model supports the features you’re using
  • Verify temperature and max_tokens settings
Snowglobe connection problems:
  • Ensure your connector is running (snowglobe-connect start)
  • Check that your app is properly configured in the Snowglobe UI
  • Review logs for any error messages
This comparison framework gives you objective data to make LLM selection decisions. Whether you’re choosing between providers for a simple chatbot or evaluating complex multi-agent systems, Snowglobe ensures you’re comparing apples to apples with identical test scenarios.The same approach works for any scale - from testing system prompts to evaluating enterprise-grade conversational AI systems.