Aug 14, 2025
Testing Changi Airport's Chatbot
As part of the AI Verify Pilot, Singapore's Changi Airport set out to test AskMax, their virtual concierge chatbot, and how it performed on realistic simulated scenarios.
Powered by a large language model (LLM), the chatbot was designed to provide reliable, context‑aware responses across key domains such as check‑in, transit, retail and transport. It serves passengers and the general public’s enquiries across multiple platforms, including the Changi Airport website and the Changi Mobile Application. AskMax helps reduce the workload on frontline teams while improving the accessibility of airport information.
Using Snowglobe, Changi Airport performed large-scale simulation testing to generate realistic, diverse scenarios to test critical failure modes including hallucinations, off-topic responses and policy violations. The platform delivered synthetic coverage capabilities that surpass what any manual test set can achieve, enabling thorough evaluation of AI system performance across a wide range of potential interactions
"Snowglobe simulated hundreds of conversations to test for AI risks such as hallucination and toxicity, helping us identify previously overlooked or under tested cases. Their risk report was also informative by highlighting areas that need further improvements.." — Joe Chiu, Vice President, Data Management Systems, Changi Airport Group
Testing was driven by synthetic prompts that emulated real customer interactions, to probe user behavior in a controlled, repeatable fashion and to measure three areas of concern: hallucination, toxic speech, and excessive refusal.
Key characteristics of synthetic test data:
Realism: Each prompt mimicked natural language, intent, and tone used by real users.
Diversity: Systematically explore use cases, linguistic variations, and structurally unusual queries that are rare in real data.
Topic Coverage: Grounded in the full set of knowledge base topics supplied by the deployer ( e.g., billing, policy, technical troubleshooting).
Each topic area was tested with approximately 100 multi-turn conversations, using prompts generated by Snowglobe's proprietary algorithm. This ensured statistical robustness without inflating manual review efforts. Conversations were grounded in the content provided by Changi Airport and contextualized to specific use cases.
The following table summarizes the Testing Strategy:
Test breakdown | One full simulation run per topic |
Test data volume | ~100 multi-turn conversations per topic, yielding statistically robust samples without inflating manual-review load |
Content sources | Deployer-provided knowledge base plus application and topic specific context |
Lessons Learned
Reprioritizing Focus Areas Through Simulation Testing
Simulated conversation provided deeper insights into how the chatbot handled a wide range of user interactions, including every day and less common scenarios. Initial testing priorities were adjusted mid-way through the pilot after observing recurring patterns. This flexibility ensured that effort was directed to areas with the greatest impact on user experience. The experience reinforced the importance of maintaining an adaptive, data-driven approach when assessing the behavior of generative AI in live environments.
Test Design - Realistic and diverse test data is critical
Static golden datasets and red-teaming datasets only cover a narrow slide of user behavior. Changi Airport needed long tail conversations that look “normal” but still stress test the system in unexpected ways in order to uncover what the most probable failure cases of the system look like.
Test Implementation
The process of aligning automated judges with human expectations was a critical enabler of large-scale evaluation. The large number of simulated user conversations was hard to analyze without these automated judges, which were often hard to define and implement precisely.
Read the full case study done as part of the AI Verify pilot.
Ready to Raise the Bar for AI Reliability?
Explore how Snowglobe can enhance the safety and reliability of your AI solution.