
Getting Started with AI Strategy
The competition is stiff among models that can be used for chatbots. With more than 100 models to choose from, where do you even start? My recommendation: Start with your highest priority and winnow down the list by other priorities.
Case in point: In my AI Strategy app, for the task of chat, I found five main metrics most benchmarks measure (and documented in their corresponding leaderboards). I listed them below in descending order of importance for most projects I’ve worked on:
- Quality
- Cost
- Speed
- Latency
- Context window

There’s a lot of info in this chart. Let’s take a minute to break it down, though you can learn more about how to use it in the post where I introduced it. Let’s break the node types down by color:
- Pink: the central node that holds the network graph together
- Orange: the task, in this case chat
- Blue: the benchmark you want to compare models against
- Green: the leaderboard
- Gray: the individual benchmark included in that leaderboard
Note: Selecting a green leaderboard node will open a modal with more information about that node, including links, tips, and benchmark definitions. One of the biggest challenges I’ve seen with these leaderboards is most leaderboard authors 🐇 to make these benchmark definitions accessible. They’re usually buried deep in an Arxiv paper or blog post, if they include a definition at all. But how can you successfully select a model if you don’t understand the benchmarks in the leaderboard?
Caveat About Model Providers
The fallout that I’ve witnessed is engineers and data scientists skip this step altogether and just suggest models that are many times poorly fitted for the function and/or too expensive to be justifiable. Even when you think you’ve nailed down a model, the price and performance can sometimes vary wildly from one API provider to another (e.g., Google’s Vertex, Microsoft Azure, Amazon, etc). For example, for Meta’s open Llama 3.3 Instruct 70B model, there are 16 API providers competing for your business, with prices at the time of writing ranging from $0.20 to $0.94 per 1m tokens and output speeds ranging from 24 to 2,195 output tokens/sec (source). It’s like the 🐢 vs the 🐇—except in this case you want the hare.

For Meta’s open Llama 3.3 Instruct 70B model, there are 16 API providers competing for your business, with prices at the time of writing ranging from $0.20 to $0.94 per 1m tokens and output speeds ranging from 24 to 2,195 output tokens/sec.
Artificial analysis
By comparison, the DeepSeek R1 model has nine API providers competing, with prices at the time of writing ranging from $0.96 to $7.68 per 1m tokens and output speeds ranging from 9 to 64 output tokens/sec (source).

We’ll Start with Quality
Having addressed that sizable caveat, let’s start investigating some of your quality benchmark options.
Artificial Analysis’ Quality Benchmark
I generally like starting with the Artificial Analysis (AA) leaderboard for chatbots because of it provides a variety of bubble charts, so you’re not just analyzing quality in a vacuum; you can see how various models measure up vis-a-vis other metrics, like cost or output speed.
AA’s quality benchmark calculates the average result across their evaluations covering different dimensions of model intelligence, which currently includes:
- Massive Multitask Language Understanding (MMLU): Evaluates a model’s ability to produce accurate and logical responses to prompts.
- Graduate-Level Google-Proof Question Answering (GPQA): Evaluates a model’s ability to multiple-choice questions written by domain experts in biology, physics, and chemistry.
- Math-500: Evaluates a model’s ability to answer university-level math problems. (Note: Different leaderboards use different math benchmarks though they may call them math…because apparently choosing a model isn’t complicated enough.)
- HumanEval: Evaluates a model’s ability to generate syntactically correct and functional Python code based on problem statements.
Depending on the type of chatbot you’re creating, AA’s Quality benchmark may be way more academic brawn than you need. 🤓📚
Chatbot Arena’s Quality Benchmarks
Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With more than 1m user votes, the platform ranks the best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. By default, their tables are sorted by rank, but I like to use the Arena Score. #ymmv

You’ll find that quite a few leaderboards adopt this arena approach. Imagine a gladiator fight where instead of two gladiators you have two models battling it out. Whichever model wins the vote by a user gets the credit for the win, and these wins add up over time.
I find that the last three columns are particularly insightful when analyzing a chatbot model’s performance:
- Instruction Following: Evaluates how well a model follows explicit user instructions.
- Longer Query: Evaluates a model’s effectiveness in handling and responding accurately to longer, more complex queries.
- Multi-Turn: Evaluates a model’s performance in multi-turn conversations, reflecting conversational consistency and coherence.
You can find a breakdown of their other benchmarks (which they refer to as ‘categories’) in this blog post. If there’s a particular benchmark that’s more important to you, you can sort the table by that benchmarks by selecting its header.
HELM Lite
The HELM framework is the most expansive framework I’ve come across. Its ‘Lite’ leaderboard is one of 11 leaderboards—and is anything but light.

Warning: Some of the HELM leaderboards are very outdated, so always check out the updated date in the upper-right corner.
One feature that’s quite extraordinary, imo, is each score in the table has a link to actual questions and responses from the evaluation. You can even see the number of tokens the model used, the runtime (in seconds), and other parameter settings, which are nested under ‘Request details’ (e.g., temperature, max_tokens, etc).
There are many benchmarks to choose from in this leaderboard—including critical safety benchmarks—so I’d encourage you to check out that leaderboard in this filtered view of the AI Strategy app.
Note: You may need to detangle them a bit first. Then I use the orientation slider to center the main node so it’s not lopsided, but if your brain doesn’t require symmetry, you can ignore this step. ⚖️😌
As you can see in the leaderboard modal for HELM Lite, it features many benchmarks. This is the only modal where I added 🚩s, but it was just to differentiate the safety benchmarks from the rest. (I order the benchmarks dynamically alphabetically, so I didn’t want to break that logic for one modal.)
Rinse and repeat with the rest of the quality benchmarks available to you.

We’ll Shift to Cost
Once you have a few models as frontrunners, you can use another benchmark to narrow down your final selection. So if it’s cost, for example, you may want to check out AA’s pricing charts. Its leaderboard breaks pricing down by input and output cost…

Or cached input prompts…

Quick Sanity Check on Other Benchmarks
It’s still worth perusing other benchmarks, even if they’re not top of mind, because they could reveal insights like DeepSeek’s R1 model has crazy high latency compared to other models.

Even if speed isn’t a top priority, an app saddled with longer lag times could be more prone to churn. Once I’m down to two models, I like to fire up the Vellum leaderboard because it allows you to easily compare two models across a panoply of options. (KLU’s leaderboard also offers this option but it isn’t updated as often. However, if you’re reading this post around the publish date, they just updated the leaderboard so I’d check it out.)

Summary
The process of selecting a model requires significant research to ensure you’re getting the most bang for your buck and not being surprised by gotchas later in the development process when it’s more difficult to course correct.
Leave a Reply