Adventures in Agentic AI

Introduction: Star Trek

In 1969, NBC cancelled Star Trek (TOS / The Original Series).

3 years earlier, when the series launched, it quickly became a sensation and garnered an extremely passionate audience. Too passionate, in fact, that NBC had to beg the fans to stop writing letters. NBC’s problem was that the production was too costly, and in return they didn’t really get the ratings they were hoping for.

Or at least they thought they didn’t.

You see, Nielsen ratings back then didn’t have demographics. NBC didn’t realize that the numbers they were seeing — and therefore the numbers they were using to make informed decisions — were flawed and incomplete.

Yes, they hoped numbers would be higher, because higher usually meant better. What they didn’t know though, was that they had a stranglehold on what eventually would be identified as the prized 18–34 audience demographic for advertisers.

It was the early days of TV ratings benchmarks. The tool was crude and incomplete, and so the benchmark score they got caused them to take the seemingly reasonable decision of cancelling a show — a blunder that would eventually make it into various “Top TV cancellation blunders” lists in the far future.

It wasn’t 100% NBC’s fault. Their ability to make good decisions was just impaired by the technology of their day.

When your measurement tools are crude and incomplete, or simply aren't measuring what's actually important, then the indicators you get will actively mislead you. 

Agentic AI Benchmarking Problems

I’ve been thinking about LLM benchmarking a lot since the early Llama 2 days. LLMs showed just extreme potential to transform enterprise processes, which was my concern, but unfortunately the task of benchmarking LLMs is a difficult one.

Over time, this state of affairs did not improve. Lots of LLM benchmarks appeared (MMLU, HumanEval, GLUE, etc), but matching scores in benchmarks to actual use case performance was essentially impossible.

There’s also the tricky problem of benchmark data getting hoovered up in pre-training data for LLMs. This compounded the existing problem (what do benchmarks scores even mean and how are they related to real-world performance?) by adding a new one — are the LLM scores improving because LLMs are just getting that much smarter, or because these are getting memorized thanks to pretraining on the publicly available, widespread benchmark data?

And this is extremely important for agentic AI deployments. When we give LLMs some tools and expect them to work autonomously and be successful, the difference between did they just memorize this by overfitting on contaminated training data vs they are really good will spell the difference between a successful, production-level enterprise deployment vs something that is, and will always be, a fantastic demo and nothing more.

The Kamiwaza Agentic Merit Index

What if we actually measure real-world enterprise capability?

And this is how I ended up with the crazy idea of trying to measure real-world enterprise task performance - which actually isn’t all the crazy. It was just very hard.

First, I had to create tech that could enable that. This is what eventually solidified into the PICARD framework (see paper here). This later on became the foundation for Kamiwaza’s fantastic Agentic AI simulator technology. Simply put, this tech lets us see how various models will perform in various enterprise tasks, and lets us see the quantitative and qualitative effects of changing models, tweaking instructions, and tweaking tools.

Then, with the framework in place, I went to work creating what we envision to become the enterprise standard for LLM and agentic AI benchmarking: the Kamiwaza Agentic Merit Index. Whether you call it KAMI or AMI, by any other name it would still be the amazing thing that it is: it shows how different models perform autonomously in real-world tests thanks to Kamiwaza’s agentic AI simulation technology.

Of course, that endeavor is too great for just one science nerd like me - the spirit is willing, but the body, wallet, inference capability and available time is weak. So we sought out partners - and one of them was Signal65. These amazing folks were instrumental in getting KAMI v0.1 out. You can read their KAMI v0.1 whitepaper here. For a deeper dive into the design (as well as the research behind LLM benchmarking issues and why they are incredibly unreliable), see our KAMI v0.1 research paper here.

KAMI v0.1 Insights: Surprises Abound

Here are a few surprising things we found in KAMI v0.1. (For an exhaustive coverage, refer to the papers linked above)

  • The original Qwen3 models (those released in April 2025, not the July refresh) drastically underperformed their older Qwen2.5 counterparts.
  • Llama 3.1 and 3.3 70B were both slightly ahead of Qwen2.5 72B. (I’m sure Meta wishes the KAMI benchmark existed way back then - it would have helped the Llama 3 image in the eyes of the community! As it were, the only benchmarks around kept showing Llama 3 in a bad light, when in fact they were already very solid in terms of autonomous enterprise work. So solid, in fact, that Llama 4 was not really any improvement!)
  • Lots of instances where bigger models don’t beat their smaller brethren.

When you measure the wrong thing, data-driven decisions will be bad

Whether we looked at tool-calling benchmarks, or an aggregate of popular LLM benchmarks, none of them were really any good at predicting the real-world performance we saw in KAMI. 

In other words - they would have misled you. You would have thought you were making good, data-driven decisions, but they would actually be incredibly bad. This could have meant either far more expensive LLM deployments than you needed, or far less capable than you wanted, or - if you were really really lucky! - you could also have gotten both: far more expense, for far less capability. Hooray!

Wait, where did we hear this before? Oh yeah, Star Trek!

Like NBC, our old tools - traditional benchmarks, community sentiment, vibes - told us one thing. Nielsen ratings told NBC Star Trek wasn’t performing well enough. It turns out, it actually controlled the most valuable demographic, 18–34. And here, it turns out a lot of pre-existing superstitions were actually far off and would have misled enterprise decision makers. Like, Llama 3.3 70B is not as hopeless, and Qwen2.5 was not as all-conquering, as I’ve previously believed myself. Or, despite the absolutely amazing splash of the original Qwen3 release, it actually was so bad, regressing heavily in autonomous enterprise performance compared to Qwen2.5 and Llama 3 (both older generation LLMs). It took a quick refresh a few months later to bring it up to speed and become excellent.

I’d say that’s surprising — but I guess that’s just what happens when measurement tools improve and you actually start measuring the thing you actually want and need, instead of just measuring the thing that can be measured easily.

Share on: