
So you want to build an AI app that users can use to generate images. Maybe these images are architecture diagrams from your codebase, tattoo designs, marketing assets, fashion designs for an ecommerce site, or images for whimsical fun. Whatever the purpose, there are lots of models to choose from, and it can be intimidating to know where to start.
Start with Strategy
I built an app to help with this process. You can access it filtered for the ‘Generate images’ task.

Disregard the fact that it looks like a figure skating stick chick. That will probably never happen again in my lifetime. ?♀️ You can learn how to use this app in my original blog post where I announced it.
Also, I checked out more image leaderboards, but some are just too outdated to include. For example the HEIM leaderboard by HELM hasn’t been updated since Aug 2023., rendering it useless. I also checked out Labelbox’s image leaderboard, but it’s both outdated and sparse. However, it’s the only one I’ve found to date that includes Google’s elusive Imagen model.

Artificial Analysis Leaderboard
What I recommend is starting with quality (of course, if that’s your highest priority) and then looking at other metrics with quality in mind. My fave leaderboard for image-to-text generation is the Artificial Analysis Text-to-Image AI Model & Provider leaderboard (AA) because—although it has quite a few bar charts and line charts, which provide a more one-dimensional view—they also include a generous number of bubble charts, which combine quality, price, and speed in a variety of views that makes decisions easier, imo.
Let’s check out their Quality vs. Price chart.

Note: As I alluded to earlier, at the time of writing, Google’s Imagen models aren’t included in either of the leaderboards in my app. Hopefully that will change in time. You can view their API pricing page, but for context, their Imagine 3 model comes out to $40 per 1,000 images, which is how the AA leaderboard compares pricing. That puts it right smack dab in the middle of the x axis of their Quality vs Price graph.
Another cool feature of the AA bubble charts is they highlight the ideal quadrant with a green background and the least attractive quadrant with gray. DALLE 3 HD is desperately trying to avoid that lava pit.
When you compare the FLUX.1 [dev] model—an open-source model developed by Black Forest Labs with API provider Nebius—to DALLE 3 HD or Ideogram v2, it’s less than a 1/10th the price with a higher quality score (although it just barely edges out the Ideogram v2 model).
It should be noted that both leaderboards included in my app measure quality using Elo scores, which is a term borrowed from gaming. Traditionally, it’s a rating system used to measure the relative skill levels of players in competitive games, like chess, where points are gained or lost based on the outcome of matches and the opponent’s rating. Many video games have adopted this system as well. It just happens to work well with ranking text-to-image and text-to-video models because users are enthusiastic in participating in these competitions. (They’re fun if you haven’t tried one. You can try your hand at AA’s arena here or imgsys’s here. ?)
I suspect Labelbox’s leaderboard is sparse because its arena just didn’t gain traction. These leaderboards also usually have a minimum number of competitions a model has to appear in before they’re added to the leaderboard.
Anyway, back to data…
For this particular graph, AA also indicates the relative speed of a model using node size. The smaller the node, the lower the generation time. I really wish these speed values were also surfaced when you hover over a node. However, you can look at other combinations of bubble charts with the models you select from the filter in the upper-right corner of any one of these charts. And you really want to check out those other models.

imgsys Leaderboard
At first blush the imgsys leaderboard appears to be a basic rank chart. However, they shoehorn a lot of features into chart. You can get more details by clicking on the imgsys node in the AI Strategy app. (You’ll need to scroll to view all the tips.)
That said, I’ll break some of the highlight features below.
- Tooltip: Click to open a modal with model details, such as the number of inference steps the model takes to generate an image (higher generally results in higher quality but also higher latency), image size, and safety tolerance. Some models have more details and some less, but if the model providers make this data available, this is where you’ll find it. It also includes a link to the model’s playground page, where you can take it for a test drive (seriously so cool!). There is also a really helpful API tab, as well as Requests and Analytics tabs. This modal basically leaves no stone unturned.
- Appearance count: Includes the number of times the model made it into the arena.
- Stats: Opening this modal allows you to compare the selected model against any other model in the leaderboard.
- Playground: This is a sandbox that allows you to see how the model responds to a prompt in real time.
- Link: This is a link to the model’s homepage, Hugging Face page, etc.




Conclusion
Between these two leaderboards alone, you are given many opportunities to thoroughly research text-to-image models before incorporating them into your app. However, I still recommend curating a list of your top three models and taking them each for a spin in your app to see how they work with the constraints of your project. Each model may perform differently depending on factors like prompt complexity, image resolution, and style consistency. Testing them in real-world scenarios will help you identify strengths and weaknesses that aren’t always apparent from leaderboard rankings alone. Also, pay close attention to generation speed, output quality, and any potential biases that could affect your use case. By experimenting with a few top contenders, you’ll gain valuable insights into which model aligns best with your app’s needs and delivers the most reliable results.
Image credit: jhin5
Leave a Reply