Here’s how many AI projects start (and never recover from):

  • Wow, that’s a great use case. What do we need for it?

  • A data pipeline, some API connections, and about $50K in dev budget.

  • How long?

  • About 8 weeks

  • Ok, let’s try. Give me an update in 4.

3 months later:

  • The pipeline works like a charm. Everyone loves the demo.

  • Can we ship to production?

  • No, the AI is not good enough yet

  • When will it be good enough?

  • We’re working on it.

The project doesn’t survive the next budget meeting.

The good news: this is completely avoidable. By testing the AI first, and building the infrastructure later.

Why This Happens

A classic IT integration project has everything a project manager loves: clear scope, predictable budget, defined deliverables ("connect system X to system Y with specs ABC"). You can track everything nicely in Jira. It’s easy to compare different vendor quotes.

Beautiful!

The question "Will AI perform well enough on our data?" has none of that. It’s ambiguous, hard to plan, and the answer might be "no".

Nobody wants to put “we don’t know yet” on a management slide. So the uncomfortable question often gets skipped in favor of the more comfortable work.

The irony is that doing the integration and “plumbing” work carries around 80% of the overall project effort. In other words – you might spend 80% of the budget only to find out that by the end, the AI solution did not work as expected.

So how can you avoid that?

It's All Just Text to an API

Let’s look at what AI actually does at the lowest level.

Every AI solution does the same thing: Take a bunch of data (preferably text, might be images, video, or audio as well), send this data to an API where it gets processed by a huge mathematical equation, and then fetch the output which – in the case of LLMs – is another piece of text. Repeat.

You can add more bells and whistles to this process such as that the text that is generated gets interpreted as an API call to another service to trigger some actions (agentic workflows), but in the end that’s about it. Whether that text output is used for a classification, a summary, an extraction, a draft – it doesn't matter.

The key insight here is that the delivery mechanism changes nothing about AI performance. In other words: the AI does not care whether you hand-typed that text, copied it from an email, or piped it through an enterprise integration – the AI model sees the same thing. It doesn't know. It doesn't care.

This means you can simulate any AI workflow by manually assembling the input and sending it to a model – testing the actual thing. The AI's ability to produce useful output from your data.

The plumbing only decides how the text gets there. It has zero effect on whether the AI can do the job.

Example: Email Triage

Let's make this more tangible with a concrete example.

Imagine you want to build a classification system for your customer support inbox. (Say you have something like support@ that is managed by a team of people.) The goal should be that an AI classifies incoming emails and moves them to different folders – such as urgent vs. non-urgent.

Here's the common way many companies would approach this (and spend weeks of work and thousands of budget along the way):

  1. Scope the integration

  2. Get access to Microsoft Exchange Server connection

  3. Figure out folder management on that platform

  4. Build an automation with something like Power Platform that can move emails around

  5. Find out that there are a ton of issues with that

  6. In week 3 you're finally able to move emails around

  7. You test the AI performance for the first time

  8. You realize the performance is not good enough – the model confuses "urgent" with "non-urgent" because it lacked context, like whether someone already replied

Here's how to approach this with a test-first mindset instead – in about a day:

  1. Define a fixed acceptance baseline upfront. What accuracy do you need for the AI to improve the job? (e.g., 90% correct classification at a given Cost Cap)

  2. Pull 100 representative emails from your inbox

  3. Export as raw .txt files (ignore attachments for now)

  4. Ingest them with a simple workflow automation tool like n8n using a simple file read, not a live inbox connection

  5. Define your classification buckets (such as "urgent" and "not urgent")

  6. Run the automation. See what the AI does with your actual data.

  7. Evaluate the performance. What number are you hitting?

  8. Try different prompts, models, or bucket definitions until performance hits your threshold

  9. Make the ship-or-stop decision based on that number

When you mock the input (static text files + n8n) instead of building the pipeline (live email integration), the same data hits the model. It gives the same output, just at a fraction of the effort.

Simple eval workflow in n8n

Evaluating Your Solution Correctly

Even when you do the test-first version above, there's a high chance someone might run this experiment as a quick test, say "the AI got most of them right" and move on.

That's not enough for a business case. You need a metric – a proper validation with defensible numbers.

Typical numbers you'd look at in a classification scenario:

  • Accuracy: Out of all emails classified, how many did the AI sort into the correct bucket? (e.g., 47 out of 50 = 94%)

  • Precision: Of all emails the AI labeled "urgent," how many were actually urgent? High precision means few false alarms.

  • Recall: Of all emails that were actually urgent, how many did the AI catch? High recall means nothing critical slips through.

Why does the distinction matter? An AI with high precision but low recall might only flag 5 urgent emails – but gets all 5 right. Great, except there were 20 urgent emails it missed entirely. Depending on your use case, missing urgent emails might be worse than having a few false positives.

For more complex outputs – like email summaries, drafted replies, or Q&A answers – there's no simple "right or wrong." You could look at:

  • Relevance: Does the output contain the information that matters?

  • Completeness: Did it capture all key points, or miss critical details?

  • Factual accuracy: Did it hallucinate or misrepresent anything from the source?

It's not untypical to use another LLM here to evaluate the results — a technique called "LLM-as-a-judge". You define your quality criteria, feed the output and the source to a second model, and let it score. It’s not perfect, but pretty scalable and surprisingly consistent once calibrated correctly.

In any case, you need a baseline to compare against – the minimum performance threshold you defined before you started. Then three things can happen:

  1. Performance sits well above your baseline → Green light. The AI works on your data. You have a number for the business case. Go ahead, build the plumbing.

  2. Performance is around your baseline → Promising, but not ready. Tweak prompts, try different models, adjust bucket definitions, get performance up or cost down. This iteration is still cheap and should take days, not weeks.

  3. Performance is well below your baseline → You just saved yourself months of work and thousands in budget. Park the project for now, or revisit when better models become available, unless the business case is large enough to justify more experimentation (or even custom development)

The only losing outcome is discovering poor accuracy after you've built the infrastructure. Every other result, including "it doesn't work", is valuable information at near-zero cost.

Conclusion: Proof Before You Plumb

I call this approach Proof before you Plumb. Before you invest in system integration and platform development, prove the AI can do the job with manually assembled data.

This applies far beyond email: contract review, invoice processing, support ticket routing, lead scoring, document classification – they all follow the same sequence:

  1. assemble sample data manually

  2. test with any available model

  3. get proof with real eval metrics

  4. only then scope the plumbing

The best AI projects don't start with huge platform investments but with a block of text and a humble question:

Does this actually work?

Because the pipeline always works – it's just a matter of time and effort. But for the AI part, there is no guarantee.

See you next Saturday,
Tobias

PS: If you want the full framework behind how to find, validate, and build AI solutions with business impact, check out my book The Profitable AI Advantage.

Reply

Avatar

or to participate

Keep Reading