Your data isn't ready for AI (and what to fix first)

Let me tell you about a conversation I had a few months back. A CTO at a mid-market professional services firm - smart, well-prepared, genuinely excited - had just got budget approval for an AI initiative. Knowledge search tool, I think, plus some automated report generation. The vendor demos had been impressive. The board was on board. The project kicked off.

Around week three, I got a call. Someone had quietly discovered that the data the AI tool needed to work properly was, how do I put this, not quite what anyone had assumed.

Not missing. Not catastrophically broken. Just not good enough. Client records duplicated across systems. Matter types tagged differently by different teams. Financial data that needed a human to reconcile it before it meant anything. And the accumulated institutional knowledge of the firm - the good stuff, the "how we handled this last time" stuff - sitting in email inboxes and personal drives where no search tool could get near it.

The project didn't die. That was almost the worse outcome. It limped forward, producing outputs that looked polished and read confidently - but were subtly, dangerously wrong. And because AI outputs sound authoritative even when they're nonsense, nobody caught the errors until a client did.

I've seen this play out four times in the last six months alone. We're evaluating AI tools before we've established whether the data those tools depend on is fit for purpose. And it's costing firms more than they realise.

The dirty secret about your data

Our data is fine. We have a CRM, a document management system, and financial reporting. The information is there.

I get it. And you're not wrong that the information exists. The problem is the state it's in.

Vague hand-waving about "data quality" is part of what's allowed this to persist, so let me be specific about what I actually see when we look under the bonnet at a typical mid-market professional services firm. Tell me if any of this sounds familiar.

Your CRM has the same client entered three times - once as "Barclays," once as "Barclays PLC," once as "Barclays Bank" - each with different contact records and interaction histories. Nobody's sure which is the master record, so everyone just creates a new entry when they're not sure. Your practice management system has matter types that were tagged consistently for about eighteen months after implementation, then gradually diverged as new partners joined and teams reorganised. "Restructuring advisory" and "corporate restructuring" and "restructuring" are three separate categories that mean the same thing. Your financial data technically exists in your reporting system, but half the management team has a personal spreadsheet that "corrects" the figures before they present to the board because the raw data needs manual reconciliation. And the accumulated wisdom of your firm - the precedents, the methodologies, the knowledge that makes experienced professionals valuable - lives in individual email inboxes, desktop folders, and the heads of people who might leave next year.

None of this is a failure of technology. It's a natural consequence of how work actually gets done across growing professional services firms over five, ten, fifteen years. People are busy. They take shortcuts. Naming conventions drift. Nobody's job is to maintain data hygiene, so nobody does.

For years, this didn't really matter. The CRM was good enough for the people who used it daily because they knew which "Barclays" was the right one. The inconsistent matter tagging didn't cause problems because the partners who ran those matters understood their own categorisation. The knowledge in people's inboxes was accessible because you could just walk over and ask them.

AI changes the equation completely.

Why AI makes this worse, not better

There's a common assumption that AI will somehow help with the data problem. That it'll be clever enough to work around the inconsistencies, figure out that "Barclays" and "Barclays PLC" are the same entity, interpret the inconsistent tags correctly.

Some of it can. A well-configured AI system with proper entity resolution can handle straightforward deduplication. But that's the easy bit, and it requires someone to set it up deliberately - it doesn't happen by magic.

The harder truth is that AI amplifies data problems far more often than it solves them. And the mechanism is worth understanding, because it's what makes this a governance issue rather than just a technical annoyance.

When an AI system searches across your document library to find relevant precedents for a new matter, it produces results calibrated to the quality of that library. If documents are inconsistently tagged - or worse, not tagged at all and buried three folders deep in a structure that made sense to one person in 2019 - the search misses relevant documents. Not occasionally. Routinely. And you'll never know what it missed, because the whole point of the tool was to find things you didn't already know about.

When an AI model analyses your historical matter data to identify patterns - which clients are most likely to need additional services, which matter types are growing, where the firm's expertise is concentrating - it produces patterns that reflect the data, warts and all. If "restructuring advisory" is split across three categories, the model shows three smaller trends instead of one significant one. The insight evaporates.

And when a language model generates reports or client communications from your financial data, it produces confident-sounding outputs that reflect exactly what it was given. A quarterly report that slightly understates revenue from a key practice area. A client briefing that references the wrong engagement history. A proposal that quotes precedent work the firm didn't actually do - because the data said it did.

Garbage in, confidently wrong garbage out. That's not a metaphor. It's an operational risk description.

I saw a McKinsey number recently that stopped me cold: AI systems trained on low-quality datasets show reliability drops exceeding 40% in production environments. Forty per cent. Imagine deploying a tool that's wrong four times out of ten, except it presents every answer with the same unshakeable confidence. Gartner reckons organisations lose around 5% of revenue annually due to poor data quality - and that's before you layer AI on top of it.

One caveat worth making: not all AI use cases are equally sensitive to this. If you're using a general-purpose language model to help draft pitch documents or tidy up internal communications, the state of your CRM doesn't much matter. The model is drawing on its own training data, not yours. But the moment you're asking AI to do anything that depends on your firm's specific data - client analytics, knowledge retrieval, financial reporting, matter analysis - you're exposed. And those are precisely the use cases where the real value lives.

The three foundations

Right. Enough about the problem. What actually needs to be in place before AI delivers reliably on your data?

The firms that crack this don't try to boil the ocean. They focus on three things.

Quality. Data that is clean and consistent. The same client appears once, not three times with different spellings. Matter types are tagged the same way across teams and offices. Financial codes are applied consistently so reporting doesn't require a human translator. This sounds basic. It is basic. It's also where most firms fall down, because maintaining data quality is unglamorous, ongoing work that nobody gets promoted for doing.

Accessibility. Data that AI tools can actually reach without someone manually extracting it into a spreadsheet first. API access to your core systems. Queryable databases rather than locked Excel files. Document libraries that are indexed and searchable rather than buried in nested folder hierarchies that only make sense to the person who created them. I worked with a firm last year that had brilliant institutional knowledge - genuinely impressive depth of precedent work. All of it sat in a SharePoint folder structure with seven levels of nesting and no consistent naming convention. As far as any AI tool was concerned, it didn't exist.

Governance. Data that is owned and maintained by someone whose actual job it is to care about it. A named person responsible for CRM quality. A process for correcting errors when they're found, rather than working around them. A standard for new data entry that prevents the problem from regenerating itself the moment you finish cleaning it up. This is the one people skip. They do the cleanup, feel good about it, and six months later they're back where they started because nobody changed the habits that created the mess.

All three are required. Two out of three produces partial value and partially reliable outputs - which, depending on your risk appetite and your clients' tolerance for errors, might actually be worse than no AI at all. Because partial reliability creates false confidence. People start trusting the outputs, stop checking them, and the errors compound.

What to fix first

"Fix your data" is about as useful as "eat healthier." Everyone knows they should. Nobody knows where to start.

Most firms can't address quality, accessibility, and governance simultaneously. Nor should they try. The priority depends on which AI use case is highest value for your firm.

If your primary interest is knowledge search and precedent retrieval - helping fee earners find relevant past work, identify expertise across the firm, surface documents they didn't know existed - then document accessibility is your first priority. Getting documents out of nested folder structures and into an indexed, queryable system. This doesn't necessarily mean migrating everything to a new platform. Sometimes it's as straightforward as implementing a search layer across existing storage, combined with a focused effort to tag the most commercially valuable documents. Start with the last two years. Start with your top twenty clients. Start somewhere specific.

If your primary interest is client analytics or relationship intelligence - understanding which clients are growing, which are at risk, where cross-selling opportunities exist - then CRM data quality is your first priority. Deduplication, completeness, consistent tagging. I know. It's the least exciting project anyone's ever been asked to lead. But without it, every insight your AI tool produces about client relationships is built on sand. We worked with one firm where a simple CRM deduplication exercise revealed they had 40% more unique client relationships than they thought - because what they'd been counting as separate clients were actually the same organisation entered multiple times. Their entire cross-selling analysis had been wrong. For years.

If your primary interest is reporting automation - generating management reports, financial analyses, or board packs with less manual effort - then financial data quality and accessibility together are your starting point. The data needs to be both accurate and reachable via API, because a reporting AI that pulls from a clean but inaccessible source is just as useless as one that pulls from an accessible but dirty one.

The 80/20 principle applies here more than almost anywhere else. Identify the smallest data improvement that unlocks the highest-value use case, and start there. A focused sprint, not a grand data transformation programme.

Honest timelines

These are based on what we've actually seen across professional services firms, not theoretical estimates.

CRM deduplication and field standardisation for a firm of around 200 people: four to eight weeks of focused effort. That includes identifying the duplicates, agreeing the merge rules, executing the cleanup, and validating the results. The elapsed time is almost always determined by how quickly you can get decisions, not how quickly you can do the technical work. The person who needs to decide which "Barclays" is the master record is usually a partner with twelve other priorities. You'll spend more time chasing that conversation than doing the actual work.

Getting core documents into an indexed, queryable library from a folder-based storage structure: six to twelve weeks depending on volume and the state of your naming conventions. If you have 50,000 documents with meaningful filenames, you're at the shorter end. If you have 200,000 documents named things like "Final v3 ACTUAL final (2).docx," you're at the longer end - and probably questioning some life choices.

Building a data governance process that prevents the problem from regenerating: two to four weeks to design, but honestly six to twelve months to embed as a habit. You can write the governance framework in a fortnight. Getting 200 professionals to actually follow it - to enter data consistently, use the agreed tags, maintain the standards when they're busy and it feels like a chore - that's a culture change, not a technology project. It requires visible leadership support, regular reinforcement, and probably some slightly awkward conversations with senior people who think data entry is beneath them.

These aren't the glamorous timelines that AI vendors put in their pitch decks. But they're honest. And here's the thing: they're significantly shorter than the timelines produced by treating data readiness as a "phase 0" that must be complete before any AI work begins.

You don't need perfect data to start getting value from AI. You need good enough data for your highest-priority use case. That's a much smaller target, and it's achievable in weeks to months rather than years.

The firms that get this right

The pattern I see among firms that successfully adopt AI - actually adopt it, not just run a pilot and write a case study - is that they treat data readiness as part of the AI initiative, not a separate prerequisite. They pick a use case, assess the data that use case needs, fix the minimum required to make it work, and build the governance to maintain it going forward. Then they move to the next use case and repeat.

Not revolutionary. Not even particularly clever. But it works, because it connects the boring data work to a tangible outcome that people actually care about. "Clean up the CRM" is a project nobody wants to own. "Clean up the CRM so we can deploy the AI tool that saves every partner two hours a week" is a project with a sponsor, a deadline, and a reason to exist.

The AI tools are getting better every quarter. The vendors are getting more persuasive. The pressure from your board and your competitors to "do something with AI" is only going to increase. And when you do move - and you will - the firms that spent eight weeks fixing their CRM data and twelve weeks indexing their document library will be running productive AI applications while everyone else is still trying to figure out why their shiny new tool keeps confidently getting things wrong.

If you want to understand where your firm's data estate actually stands - and which improvement would unlock the highest-value AI use case for your situation - book a data readiness assessment with us. It's the least exciting step in your AI journey. It's also the one that determines whether everything that follows actually works.

Your data isn't ready for AI (and what to fix first)

The dirty secret about your data

Why AI makes this worse, not better

The three foundations

What to fix first

Honest timelines

The firms that get this right

You Might Also Like

US Office

UK Office

Your data isn't ready for AI (and what to fix first)

The dirty secret about your data

Why AI makes this worse, not better

The three foundations

What to fix first

Honest timelines

The firms that get this right

You Might Also Like

Get insights that drive results