
I’ve had the pleasure of working with quite a few teams now on various AI projects that have spanned five industries. I’ve seen AI projects be wildly successful and some crash on the rocks of unmet client expectations. And across projects and industries I’ve seen some common denominators that have led to projects failing. I’ll highlight the more egregious issues here.
Not Working from a Single Source of Truth
To be completely frank, this is the issue I’ve been the most surprised by as it’s the easiest to solve for imo. Agencies, in particular, seem to š«¶ PowerPoint presentations. With one project I worked on last year, the first three months of the project were consumed with creating PowerPoint presentations. At the risk of sounding unforgivably self-aggrandizing, in that same amount of time after bouncing from that project, I created three apps working solo. And I’m releasing my fourth for machine learning strategy next week. Sneak peekš.

Anyway, back to our regularly scheduled program…
Besides the obvious risk of losing the client’s interest with a bombardment of deliverables that talk about what we were fixin’ to do, it introduced a challenge where internally we’d all oftentimes be working from different documents because there were so many at play. We’d also have multiple versions of the same draft document that we’d then need to try to sync up on. Sure, there were the MSA and SOW, but it’s easy to tweak a presentation with the preferences of one stakeholder, just to have another stakeholder contradict that marching order when presenting said PowerPoint.
The result internally at times was a lack of direction, as well as redoing work at the 11th hour to realign it with the most updated direction.
Having a gravitational pull toward wanting to gather up requirements into interactive apps, I created a project management app for one project that was starting to go off the rails. Hovering over a node in the graph provided rudimentary notes, and clicking it filtered the table to that particular category, subcategory, or deliverable and provided a lot more details. (Not surprisingly if you’ve followed my work, if a project has a natural hierarchy to it, I’m using a network graph, treemap, sunburst chart, or flowchart to corral it into some kind of order.)

Even if you don’t have someone on your team who can create interactive apps that everyone can follow along with (šāāļø), I highly recommend having a single source of truth that’s ideally more visually interesting than a spreadsheet and less obtrusive than a rambling PowerPoint presentation to ensure your project remains transparent, well-oiled machine. However, creating a lightweight app is quite straightforward, communicates finesse to stakeholders and is engaging. Most importantly, it’s transparent, which can go a long way toward managing expectations and avoiding legal threats for breach of contract.
Not Clearly Communicating Ramifications of Resource Decisions
For the first few projects I worked on, for reasons I found baffling, the team failed to disclose to the client that they wouldn’t be enabling streaming for the most viable product (MVP) version of the deliverable. In the context of AI apps, streaming refers to the process of transmitting data incrementally, rather than waiting for the entire response to be generated before delivering it to the user. This allows for real-time or near-real-time interactions, which improves responsiveness and user experience.
Engineers and data scientists on these projects underestimated client expectations with respect to the performance of AI apps in their embryonic stages of development. Invariably, when the client would excitedly get their hands on their MVP app to test internally, their reactions often reminded me of this adorable video.
However, clients are infinitely less endearing after they experience what feels to them like bait and switch. I mean, technically no SOW was violated. Since many times clients don’t realize that the streaming they experience with most AI apps isn’t a given, they don’t know to request this feature. But by not presenting them with the ability to decide to have this feature enabled (which may require backend adjustments to support it), the disappointment in waiting for a response to sometimes take 15+ seconds to generate would taint the rest of the project and contribute to a more strained, us-against-them dynamic.
Using the Wrong Model for the Job
If you follow my blog, you know I talk a lot about AI strategy. To this end, I’ve created two tools (with a third on the way) to aid AI practitioners because the industry, as well as its many technologies, are moving at a breakneck pace. The reason I created the AI Strategy app, in particular, is that the leaderboards that contain performance metrics are nearly impossible to interpret. For one, benchmark providers oftentimes bury their benchmark definitions deep in an Arxiv paper or unlinked blog post, if they include them at all. With benchmark names like TAL-SCQ5K-EN, Dolphin1878, and GSM8K, my Google Spreadsheet quickly became insufficient in chasing down these metric definitions. I’ve highlighted a few below, but they are a small sample of a trend I’ve seen across leaderboards.

Every once in a while, a leaderboard will provide tooltips with benchmark definitions on hover but it’s very rare. ā¬ļøā Vellum!

My Machine Learning Strategy app features two different types of tooltips and modals. Here’s just one example that’s good for providing high-level definitions to guide users.
What I’ve seen, as a result of the difficulty in navigating these leaderboards, has been an over-dependence on models that may be more brawn (and hence more expensive) than a particular task calls for or the converse (i.e., a model that’s not intelligent enough for the task).
Case in point: At the time of writing, the FLUX.1 [dev] text-to-image model offered by API provider Nebius is reported by Artificial Analysis (AA) as having a price of $7 per 1,000 images generated compared to Ideogram v2 and DALLE 3 HD’s price of $80. It also edged out both of their quality scoresāand is a faster model, as indicated by the size of its node in comparison to theirs. So it’s less than 10% of the cost, with a higher Elo scoreāand faster.

Note: AA also used to include Midjourney on this graph, with a caveat that it doesn’t have an API, but they have since removed it. It’s not even an option in the model filter in the upper-right corner (which you should definitely check out since their charts only show a small fraction of the available models in the default view. However, when Midjourney was included, it consistently performed very similarly to Ideogram and DALLEāin cost, quality, and speed.
Now let’s say you’re trying to decide between Google’s Gemini 1.5 Pro model vs Meta’s Llama 3.1 405B model and fire up the KLU leaderboard.
For most tasks represented in the KLU leaderboard, Llama 3.1 wins out, yet Gemini 1.5 Pro is more than 4xs the cost. š°

Tip: If you just want to find out the definition for a benchmark in this comparison widget, you can search the AI Strategy app and activate the toggle to include all tasks in the search. Any node with a search result will be highlighted, and the keyword will be will be highlighted in leaderboard modals for ease of scanning.

Of course, it is incumbent upon you to verify all of these metrics, but the KLU leaderboard was updated the day I wrote this post so at least their data wasn’t stale (which can be an issue with these leaderboards).
Not Using a Dialog Management System
Over-reliance on language models for tasks like routing a userās intent or managing conversation flows for more complex apps is oftentimes an expensive trap. This is where a dialog management system (DMS) like Rasa or Dialogflow can be a boon.
For example, letās say your app helps users book flights. Instead of asking an LLM to determine if the user wants to search flights, book a ticket, or ask for a refund, a DMS can route those intents with rules, machine learning models, or AI. This approach is oftentimes faster, cheaper, more traceable, and less prone to error and dead-ends. Keep LLMs for tasks that genuinely require deep understanding, and let a dedicated system handle the rest.
There are also risks associated with Jinja2 formatting, if your developer(s) isn’t experienced with it. To wit, one AI app I worked on was behaving very strangely and communicating with odd turns of phrase. So I got the keys to the kingdom and audited the prompt instructions. There were formatting issues all through the Jinja2 template files because apostrophes weren’t escaped, rending the instructions that were supposed to communicate to the LLM how to help students through math problems socratically, which is tough for models because they just want to give you the answer.

Side note: We could’ve turned references it made to “Defeats fuel victories! š” into a drinking game. I found that ill-fated prompt instruction and kicked it to the curb because it made the tool sound like a colossal dork.
Another issue was options were being hardcoded into the template files that should’ve been handled with variables on the backend. It was a hot mess. Whatever money we saved the client on a DMS was surely countered by lost productivity and client consternation.
If you do go the DIY route, at minimum, factor in an auditing step by someone who’s very comfortable with engineering backend logic, creating template files, and diagraming intuitive and inclusive flows.
Not Prioritizing the Role Data Cleanup Plays in the Project
I’ve witnessed both the tendency for project owners to be too concerned with the state of their data as well as the tendency to sweep critical issues under the rug, with a greater tendency toward the former. As a data aficionado, you’d think I’d be more concerned about the state of the union with regard to a client’s data. However, whether they’re using a data transformation solution like dbt (Data Build Tool), Apache Spark, or Fivetran or just haphazardly joining data in their data storage platform (something I liken to building a website without CSS), it’s pretty straightforward to introduce a data transformation layer to get data from different sources like data lakes, databases, and APIs to align.
I’ve only seen one situation that was truly unworkable because the client had a custom storage platform that was quite outdated and didn’t have an API, making it an island. But building an API is surprisingly straightforward, especially if you only need a few endpoints for your app. (I know because I had to for one of my courses.)
In my portfolio I demo a simple pipeline I created in dbt using publicly available data (since I can’t demo any with proprietary data).
If your app combines data from different sources or you’ve ever had merging issues because dates were formatted as strings (looking at you, Google Analytics š¤Ø), having data that’s not runway is a speed bump, not a road block.
That said, getting a client to prioritize the cleanup steps necessary to get these data sources flowing together using data definitions that the team can agree on (e.g., how to handle null values, how to format dates, whether to use floats or integers for a particular metric, how to define segments, what you want users to be able to filter by in the app, etc) can prove to be challenging when upper management is twitching for something to get into the hands of their end users.
Kowtowing to Senior Management
This is a problem that’s not easy to solve for. Leadership is necessary, and at the end of the day someone needs to make a final decision. However, how those decisions are approached can make all the difference in how well the team works togetherāand ultimately how well the app works.
I had the privilege of being part of one project last year that was led so expertly, each person on the team was either comfortable with contributing or gently nudged to contribute. As a shy extrovert, I’m not a big talker in meetingsāespecially if I see tense dynamics. I’ll speak up if I feel like I have an idea that can ease someone’s burden or if I feel like a particular direction is putting the project at risk. Otherwise, I’d rather be behind the scenes in more of a support role, researching and coding.
Except with this one team.
One thing I loved about the team lead was her natural curiosity. Where I sometimes witnessed other team leads quell opposition and quash those who shared alternate perspectives, she would ask questions. One contribution I’d make was to explain some of the headier aspects of AI in simpler terms, to make it more approachable. I can say without hyperbole this team lead should be studied by the company she works for because her leadership style was a significant contributor to what was the most successful AI project I’ve worked on to date. The client was thrilled with our work, our team worked together swimmingly, and everyone contributed because it was safe.
One habit she had, if someone tentatively tossed out an idea, was she’d kind of tilt her head curiously and tell them, “Say more.” I thought it was a clever way to give someone a green light to share more details about their idea. And, if she exercised her power of line-item veto, she was careful to do it in such a way as to not quash that person’s future participation. And I never witnessed even a scintilla of backlash or retaliation. Instead, she’d say things like, “Maybe we could consider that for MLP (most lovable product).”
I personally believe the best way to ensure this dynamic doesn’t thwart your AI project is to monitor the morale of the team. It is usually painfully obvious when a team is suffering from overbearing managers or team members by the diversity of feedback in a meeting as well as the percentage of talk time among some members (a data point that platforms that use AI will often provide).
I would go so far as to say this factor has more often than not correlated with how smoothly an AI progressed because, in my experience, team leads who inspire creativity and research internally also tend to be more collaborative and less cagey with the stakeholders they answer to.
Leave a Reply