THE BRIEFING ROOM

How to evaluate AI vendor claims without a data science degree

"Our AI learns from your data automatically." "Implementation takes weeks, not months." "Ninety percent accuracy, out of the box."

If you've sat through more than two AI vendor demos this year, you've heard at least one of those. Probably all three. And here's the thing - none of them are lies, exactly. They're just not telling you what you think they're telling you.

I've been in the room for a lot of these conversations lately. Managing partners, COOs, operations directors - smart, experienced people who run complex businesses and make high-stakes decisions every day - sitting across from a vendor who's showing them something genuinely impressive on a screen. The demo works beautifully. The slides are slick. The ROI projections look transformative. And then somewhere between the third coffee and the pricing slide, there's a moment where the buyer thinks: I don't really understand the technical detail here, but it looks like it works, and they seem to know what they're doing.

That moment is where the expensive mistakes happen.

I know because I've made one. A few years back, I was evaluating an AI-assisted workflow tool for a client and got genuinely swept up in the demo. The vendor was sharp, the use case was compelling, and the numbers looked solid. We moved quickly. Three months in, it became clear that the accuracy figures we'd been shown were measured on a curated dataset that bore almost no resemblance to the client's actual documents. We got there eventually, but it cost time, goodwill, and a fairly uncomfortable conversation with the client's COO. I've been more careful since.

The vendor wasn't dishonest. That's the thing. The problem is subtler: the vendor's incentive is to show you the best possible version of their product under the best possible conditions, and your incentive is to understand what it will actually do in your environment, with your data, on a Tuesday afternoon when someone's fed it a badly formatted spreadsheet. Those two things are not the same.

But I don't have the technical background to evaluate an AI vendor properly. I'll have to rely on my IT team's assessment.

Your IT team should absolutely be involved. But what I've learned from watching these evaluations go sideways - and from getting one wrong myself - is that the questions separating a good AI vendor from one that's going to cost you six figures and eighteen months of frustration aren't technical questions. They're commercial, operational, and trust-based. You already have the skills to ask them. You just need to know which ones.

The honest translation of what you're being told

Let me give you a phrasebook. The language AI vendors use has a specific meaning that's different from what a non-technical buyer reasonably infers.

"Our AI learns from your data automatically."

What you hear: plug it in, point it at our files, and it'll get smarter over time without much effort from us.

What it actually means: the model was trained on generic data. It will need significant configuration and, in most cases, months of supervised calibration to produce useful results for your specific context. "Learns automatically" describes something that usually requires ongoing human oversight, correction, and retraining. The word "automatically" is doing a heroic amount of heavy lifting in that sentence.

"Implementation takes weeks, not months."

What you hear: we'll be up and running quickly.

What it actually means: the vendor's standard configuration - their out-of-the-box setup with demo data - takes weeks. Getting it to work with your specific data formats, your workflows, your integration requirements, your compliance constraints, and your team's actual working patterns? That takes months. The gap between "the product is installed" and "the product is useful" is where most of the time and money lives, and the vendor's timeline usually only covers the first bit.

"Ninety percent accuracy."

What you hear: it gets things right nine times out of ten.

What it actually means: 90% accuracy measured on a test set the vendor controlled, using a task definition that favours the model. The accuracy on your data, with your edge cases, your inconsistencies, and your definition of what "right" looks like may be substantially lower. I sat with a firm last year that had been told a document review tool was "92% accurate." When we tested it on their actual client files - which included handwritten notes, scanned PDFs, and documents in three languages - the accuracy dropped to about 61%. The vendor's number wasn't wrong. It just wasn't relevant.

If any of those phrases sound familiar from a recent demo, don't panic. It doesn't mean the vendor is trying to con you. It means you need to dig deeper before you commit.

The questions that expose the gaps

These are the questions I'd ask in any AI vendor evaluation, and none of them require you to understand neural networks or transformer architectures. They require you to watch the vendor's reaction.

"Show me this working with data structured like ours."

Not their demo data. Not a curated example. Data that looks like what you'd actually feed into the system. The vendor who can do this on the spot - or within a day or two - is a very different proposition from the vendor who needs a two-week data preparation window. That gap tells you something important about how much work sits between the demo and reality.

"What happens when the AI is wrong?"

This is my favourite question, because the answer reveals everything. A vendor who deflects with "it's very accurate" hasn't thought carefully about the operational implications of errors in your context. The right answer describes specific failure modes, how often they occur, and how the system flags them for human review. If you're a law firm processing contracts, or a financial services firm handling compliance documents, a 10% error rate isn't a rounding error - it's a risk management problem. You need the vendor to talk about it like one.

"What data quality do you need from us?"

A vendor who says "we can work with any data" is, to put it politely, not being straight with you. Every AI system has data quality dependencies - format, completeness, consistency, labelling. A vendor who specifies those requirements honestly is telling you the truth about what it'll take to succeed. A vendor who minimises them is telling you what you want to hear. I know which one I'd rather work with, even if the honest answer is less comfortable.

"Can you show me a reference client of similar size and complexity who has been live for at least a year?"

A year. Not three months. Not a pilot. A year in production, in a business that looks something like yours. A vendor who cannot provide this is either new to the market, hasn't retained clients past the initial contract, or is asking you to be an early adopter. None of those are necessarily disqualifying - but all of them affect how you should price the risk and structure the engagement.

The red flags that should change the conversation

Some signals shouldn't just prompt follow-up questions. They should fundamentally change how you approach the rest of the evaluation.

No pilot option, or active resistance to piloting. A vendor confident in their product's fit for your context will welcome a well-designed pilot. A vendor who pushes for a full contract upfront - or who agrees to a pilot but makes the terms so restrictive it can't produce a meaningful result - is managing their own risk at your expense. Full stop.

No specific metrics in the business case. If the vendor's ROI model is built on generic industry averages and vague productivity improvements rather than measurable outcomes tied to your specific workflows, they haven't actually thought about whether this will work for you. They've thought about how to close the deal.

"Just trust the algorithm." If you ask how the system reaches a particular output and the answer is some version of "it's complex but it works" - walk away. Actually, walk briskly. In any regulated or client-facing context, you need to explain your decisions. If you can't explain the AI's decisions, you can't use it. And a vendor who doesn't understand that hasn't worked with businesses like yours.

The demo environment that looks nothing like real-world usage. Clean data, perfect formatting, ideal conditions, pre-selected examples. If the vendor can only show you the product in this environment, they haven't successfully deployed it in a messy, real-world one. Which is exactly the environment you'll be deploying it in.

I've seen enough of these evaluations go wrong - and seen the aftermath when they do - to say with some confidence that the failures rarely come from bad technology. They come from exactly these kinds of gaps between what was demonstrated and what was actually being bought.

How to design a pilot that actually tells you something

So you've asked the questions, you haven't hit any deal-breakers, and the vendor's responses have been credible. Now you need a pilot. But not the kind where the vendor runs a controlled demonstration for six weeks and then presents a deck showing how well it went. You need a pilot that functions as a genuine test.

Four things make the difference.

A specific use case, not a general capability test. Don't pilot "the AI." Pilot a specific task. Something like: "We'll use this tool to process the 50 due diligence documents from our last completed matter and compare the AI output to what our fee earner produced." Or: "We'll run three months of client email triage through the system and measure categorisation accuracy against our current manual process." The more concrete, the more useful.

Success criteria defined before the pilot starts. This is the one that trips people up. If you don't agree what success looks like before the pilot begins, the vendor will define it afterwards based on whatever the pilot happened to produce. Write down the metrics, the thresholds, and the decision criteria. Get the vendor to sign off on them. If they push back on defining success upfront, that tells you something.

Realistic data. Not curated examples. Not the vendor's sample dataset. Your actual data, including the messy, incomplete, and inconsistently formatted stuff that makes up real life. If the AI can't handle your actual data quality, better to find out now than after you've signed a twelve-month contract.

A four-to-six week timeline. Long enough to encounter normal variation in the task - the weird edge cases, the Friday afternoon rush, the document that's in a format nobody expected. Short enough to produce a decision rather than a prolonged evaluation that loses momentum and never quite concludes. Some use cases genuinely need longer, particularly if there's a seasonal pattern or the workflow has natural cycles. But for most initial evaluations, four to six weeks is a solid starting point.

A pilot that produces negative results hasn't failed, by the way. It's done exactly what it was designed to do. It's saved you from signing a contract that wouldn't have delivered.

What good vendor transparency actually looks like

I don't want this piece to read as anti-vendor. Some of the best AI vendors I've encountered are doing genuinely good work, and the good ones behave in ways that are easy to recognise once you know what to look for.

They're honest about limitations before you ask. The vendor who volunteers what their product doesn't do well - and explains why - before you've raised the question is demonstrating the kind of honesty they'll need when the product is live and hits a problem. We worked with a firm recently where the AI vendor said, unprompted, "Our document extraction works well for structured contracts but struggles with handwritten annotations. If that's a significant part of your document set, we should talk about that before you commit." That conversation saved the client weeks of frustration and, frankly, made us trust the vendor more, not less.

They're clear about requirements rather than minimising them. The vendor who specifies exactly what data quality, system integration, and internal resource is required for success is telling you the truth. The vendor who says "it's minimal effort on your side" is not. Honest vendors know that underselling the implementation requirements leads to disappointed clients and cancelled contracts. They'd rather tell you the hard truth upfront.

They genuinely welcome a well-designed pilot. Not reluctantly. Not with conditions that gut the test. The confident vendor knows their product works and wants to prove it under realistic conditions. The nervous vendor wants to control the conditions so tightly that the pilot can't produce a meaningful negative result.

So where does this leave you?

The point of all this isn't to make you paranoid about AI vendors. It's to give you an approach that lets you evaluate them with the same rigour you'd apply to any significant business decision. You wouldn't hire a senior partner based solely on a great interview and a polished CV. You'd check references, look at their track record, and probably have them meet the team in an unstructured setting to see how they actually operate. Same principle.

Good vendors survive this process. In fact, good vendors appreciate buyers who run it, because it means they're competing on substance rather than on who gives the best demo.

The questions in this piece aren't difficult. They don't require a data science degree. They just require you to be as rigorous with an AI vendor as you would be with any other partner you're trusting with a meaningful investment. You already know how to do that. Now you know what to ask.