Evaluating AI Agents Like Products, Not Prompts

In the age of Agentic AI, shipping without deeply understanding user experience isn’t just a risk, it’s an existential threat to your product.

Apr 02, 2025

Product development used to be a fast loop: ship, track, fix, repeat. A few A/B tests, some bugs logged, and you were iterating in the order of days. But today’s AI-native products don’t just have bugs, they have cascading behavioural failures that can’t be hot-fixed. When your AI agent makes a wrong assumption or misunderstands user intent, the ripple effect can take months to fix. And by then, your user’s trust is gone. In this blog, we share our learnings from conversations with leaders building AI products, as well as insights from our own experience building a conversational AI chatbot at Zealth.

The shift with Agentic AI systems is that they now become the user interface, rather than just a backend enabler like traditional machine learning systems. Earlier, fixing a bug or usability issue was a simple code change—typically resolved in an hour, or at most a day, and you were good to go. But with AI agents acting as the interface, incorrectly classified intents or poorly handled flows require either retraining (or fine-tuning) the entire AI model, or modifying the prompt. Either way, you must re-test all user flows, since they're all driven by the same underlying model. This can take days or even months, depending on your use case and by then, user trust is already lost.

Real-world Example

While building our symptom checker chatbot at Zealth, we developed an entire knowledge graph of symptoms and their corresponding home management protocols. We also built an LLM that could understand the user’s intent, extract symptoms, ask follow-up questions, and suggest the most relevant management protocol, such as diet or exercise recommendations, for home care. We had accounted for symptom related queries like “I am feeling sick“ or “What should I eat because I’m feeling nausea.” For non symptom related queries, such as “Zyada protein ki cheeze batao,” we provided a standard out-of-context response in the first version. However, as soon as we went live, the very first user query was “Sir k pichle wale hisse me dard hai” ( “Sar” also sometimes spelled as “Sir” in hindi means “head” and the sentence actually means “Pain in back side of the head”). But our chatbot confused “Sir” in english with “Sar” ( or head ) in hindi from this sentence & the conversation went in a totally different direction—leading to a poor user experience. The issue stemmed from how we had structured our prompts. In hindsight, we should have handled such inputs by asking follow-up questions. But in our rush to release the first version and learn from it, we hadn’t accounted for this edge case during testing.

Unlike traditional software testing which has definite number of pathways in our product / feature, AI agents have infinite number of pathways making it impossible to test them all!

High Stakes & Real Consequences

A user hitting an unhappy flow can be extremely costly, especially in high-stakes applications. For example, in our case, the first patient of a doctor we had been pursuing for almost a year, who also happened to be a close friend of the doctor, encountered such a scenario. They discussed it with the doctor, and as a result, we lost the doctor’s trust and the pilot opportunity. This one bug taught us more about real-world agent UX than 200 happy-path evals ever could. Similar situations in other healthcare or finance-related bots highlight the need for exhaustive testing of boundary cases. It's critical to ensure that the bot does not make diagnoses or provide recommendations that are non-compliant with industry standards—right from day one.

Why does this problem exist?

Current evaluations rely on manually reviewing user queries and selecting test cases, a process that’s time-consuming and limited by team bandwidth. At Zealth, we started by manually chatting with patients on WhatsApp to gather real conversations. From 200 such chats, we created our training and test datasets. But with only 30 conversations in the final test set, we barely scratched the surface of real-world scenarios.

And it’s not just us. This challenge resonates across the industry. In general, the most common way to assess the usability of AI agents is by setting up evaluations (evals). The most popular LLM eval tools, such as MLflow, DeepEval, and Arize AI, focus only on evaluating individual models or prompts within an agentic system. As one Lead ML Engineer at a major transport company in SEA put it: “Testing individual ML models is easy. Testing how they interact in a real product is the hard part.”

The biggest risk isn’t broken flows. It’s misunderstood users. Current eval tools miss emergent UX failures that happen when multiple agents, models, or prompts interact in unpredictable ways.

An example eval from DeepEval - focussing on a particular prompt, however, adding the test cases is the hard part.

How to solve?

This problem can be reframed from a data science perspective—what if you could simulate all possible interactions with your chatbot, based on user personas learned from their digital behaviour, in just a matter of minutes? You could then evaluate these simulated flows against your key performance indicators (KPIs), automatically flagging unhappy paths or risky interactions, allowing you to fix them before real users ever encounter them. Such an approach could even enable proactive testing for bad actors and edge cases, including scenarios where non-compliance or unsafe recommendations might occur, critical for high-stakes domains like healthcare or finance.

If you're tackling similar challenges or exploring solutions in this space, I'd be excited to exchange ideas. Let’s connect.

Dheeraj Mundhra

Discussion about this post