Tracking Conversation Trees: Measure and Optimize AI Agent Flows
A question that I always try to think about when looking at great AI Demos : What went behind making this production ready? How would they actually measure and optimise the AI Agent beyond evals?
A few weeks ago, Deepali showed me a demo of Manus AI, and to be honest, like all of you, I was blown away! I’ve been trying to get an invite ever since. As soon as I saw the demo, I couldn’t help but wonder what it must have taken to bring this into the real world? Beyond surpassing OpenAI’s benchmarks and previous state‑of‑the‑art models on standard datasets, I was curious how the product manager or QA team at Manus gained confidence in the tool’s ability to deliver value to end users while keeping the guardrails intact. Clearly, it’s impossible to test every possible pathway for the AI Agent, and you can’t predict in advance exactly how end users will incorporate it into their daily workflows.
In the demo, the co‑founder of Manus AI shows three examples of Manus’ use cases, one of which I will deep‑dive into one of them, property research in New York State. In this example, they show Manus conducting research to filter properties in New York based on criteria such as low crime rates and high‑quality schools for the user’s kindergarten and high‑school children, for a family with a combined monthly income of $50K. Manus AI first breaks the task down into three steps: the research phase, where it identifies neighbourhoods based on safety, school quality, and budget; the search phase, where it uses Redfin (not sure why it specifically chose this website) to create a comparative analysis; and the reporting phase, where it prepares a property‑comparison report for the end user. It begins by opening a browser and searching for news articles reporting crime rates in New York neighbourhoods. Then Manus carefully reads multiple articles to identify middle and high schools and writes a Python program to calculate the budget the user could spend on purchasing their new apartment. Next, Manus fires up the browser agent again to look for listings on Redfin in the recommended neighbourhoods, such as the Upper West Side, within a $1.8M to $2.6M budget ( calculated from the monthly income but with certain assumptions in mind ). Finally, Manus gathers all insights and writes a detailed report on its recommendations, including the key benefits for the user’s family for each property. All of this in minutes!

Now, for simplicity assume you are the product manager of this tool to help with property research, and are responsible for the success of this tool in the market. We assume the success criteria to be the number of final property purchases made by the user that your tool had suggested. This would include secondary metrics such as how many sessions did it take for the user to come up with the right suggestion, and user satisfaction scores. A few questions that come to my mind immediately are —
How do we know if a user went through a happy flow? or an unhappy flow? Yes - thumbs up, thumbs down and rating are brute force but how many of us actually rate it? If no, what implicit signals (click‑through rate, rapid refinements) do we use to judge feedback?
At what point do we start the task execution vs ask the user for the assumptions they may have implicitly made? For example in this use case we do not know the other financial obligations or current lifestyle of the user. Their willingness to spend on their new home and location could change accordingly. More importantly criteria such as pet ownership may seem trivial for the user but is actually very important during the search process. One approach could be to come up with the assumptions and ask upfront, however, that could lead to a delay in the “aha moment” for the user, so what assumptions are optimal to ask beforehand vs letting the user iterate later?
Manus writes a python program to come up with a potential budget. At this point would it make sense to pause ask the user for their financial obligations and then move to the next step? Or would it make sense to just move ahead with the assumptions assuming the user has shared the important information in the initial prompt. What would be the right financial model to actually come up with these estimates? Should we give this lever to the user?
Next you need to work with the GTM team to roll this product out to the right users, how would you test the edge cases, regional questions, etc of this tool to make sure it actually provides value to my ICP? How would the overall user conversion funnel look like?
Well the answer to most of these questions lie in the iterative cycle of product development, by A/B testing each of these core assumptions against our KPI. While tools like Mixpanel, Amplitude, exist to answer such questions for web applications, the “paths” in AI Agents are not clickstreams, but conversation trees which are complex to track. The construction of a conversation tree starts with the first prompt given by the user to the AI Agent, followed by the visual cues generated as a part of the “deep thinking” shown to the user by the Agent. Followed by the final output given by the agent to the user. The conversation tree additionally includes the response from the agent to each subsequent query of the user. This conversational tree, once discretized for each user interaction via a combination of computer vision and NLP algorithms, would need to be analysed for implicit signals of the user impacting the primary business KPIs.
Once you come up with the possible issues the user is facing, another mammoth lies in updating the AI agent to handle the unhappy flows. Every time you tweak a model or a prompt you need confidence that your KPI will improve! In the SaaS era, changes in one part of the flow would most likely not affect the user experience in a completely different flow of the user, so testing was relatively faster, however, now any update to the model or prompt could change the user flow ( aka user experience ) in a completely differently functionality. We would need a set of evals to test all the flows after any changes in the model or prompt, major or minor. Think about it like notion-type templates of evals that we run every time to ensure the KPI looks better at-least during offline evaluation. But are eval framework provided by the current tools such as Arize AI, ML Flow, etc good enough to test an ensemble of models? How do we actually know multiple prompts are not interacting to create an unhappy flow? Do we actually need a simulation bot to interact with our product wearing different hats of the user? A simulation bot that would try to intentionally break one of the user flows pushing the product to the edge?
If you're building AI agents and facing these challenges, let's design the future of AI product analytics together. Feel free to reach out here.