THE BRIEFING ROOM

Agentic AI for service firms: what's real and what's next

Let me tell you about a conversation that went sideways on me last month.

I was sitting with the CTO of a 200-person consulting firm - sharp bloke, been in tech leadership for fifteen years - and he pulled up a demo from one of the big AI vendors. The pitch was slick: an "AI agent" that could take a client brief, research the topic across multiple databases, draft a proposal, check it against past engagements, and deliver a polished document ready for partner review. Autonomously. In minutes.

He turned to me and said, "Is this real?"

And the honest answer - the one I gave him, and the one I want to give you - is: sort of. Parts of it are genuinely working in production environments right now. Other parts are about eighteen months away from being reliable enough to trust. And a few bits are, frankly, still marketing theatre dressed up as capability.

What made that particular demo so frustrating wasn't that it was dishonest, exactly. It was that the vendor had stitched together the three things that work with the two things that don't, and presented the whole package as if the seams weren't there. The CTO walked out thinking he was six months from deploying something. He was probably eighteen months from deploying half of it.

The problem is that nobody in the vendor ecosystem has much incentive to tell you which is which.

Agentic AI sounds like science fiction. The AI tools we've tried barely complete a single task reliably. The idea that they can plan and execute multi-step workflows seems very far away.

I hear this constantly, and I get it. If your experience of AI so far has been a language model that occasionally hallucinates a case citation or a chatbot that confidently gives wrong answers, giving AI more autonomy feels counterintuitive at best and reckless at worst. But that instinct leads you somewhere unhelpful, because agentic AI isn't a different technology from what you've already been using. It's the same underlying models, wired together in a workflow architecture that lets them do more between the moment you give an instruction and the moment you get a result back.

The question isn't whether agentic AI works. It's which applications work reliably enough, right now, to justify putting them into your operation - and which ones need another year of development before you'd want them anywhere near a client deliverable.

That's what this piece is about.

What "agentic" actually means (without the jargon)

The word gets thrown around a lot, so let me be precise.

A standard AI tool - the kind you're probably already using - does one thing when you ask it. You give it a document and say "summarise this." You paste in some notes and say "draft an email." One instruction, one task, one output. The human decides what happens next.

An agentic AI system works differently. You give it a higher-level objective - "research this topic and produce a briefing note" - and the system breaks that down into a sequence of steps. It might query three different databases, synthesise what it finds, identify gaps, go back for more information, structure the output, and check it against a set of criteria. All without you directing each individual step.

Think of it like the difference between asking someone to type a letter you've dictated versus asking them to prepare a briefing for a meeting. The first is a task. The second requires planning, judgement about what's relevant, and decisions along the way.

Those intermediate decisions are both what makes agentic AI genuinely powerful and what makes it genuinely risky. The system is making choices you don't review until the end - and those choices might be wrong in ways that aren't obvious from looking at the final output. I've seen a system produce a beautifully structured research note that had quietly discarded the three most relevant sources because they were in a file format it didn't handle well. The output looked authoritative. It wasn't.

What's working in production right now

I want to be careful here, because the gap between "working in a demo" and "working in a mid-market firm's actual environment" is wider than most vendors will admit. But there are three categories of agentic application where we're seeing genuine, sustained production use - not pilots, not proofs of concept, but operational deployment.

Multi-step research workflows. Probably the most mature agentic use case for service firms. The system receives a research brief, queries multiple sources - internal knowledge bases, external databases, regulatory filings - synthesises the findings, and produces a structured output. According to Clio's 2025 Legal Trends Report, 79% of legal professionals are now using AI tools, with agentic workflows emerging specifically in document review and legal research. The key qualifier: this works reliably when the research domain is well-defined and the source data is structured. Give it a focused regulatory research brief and it'll do a genuinely good job. Ask it to do open-ended strategic research across ambiguous sources and you'll get something that reads well but may miss what matters. The system doesn't know what it doesn't know.

Document workflow automation. Agentic systems that process incoming documents, extract relevant information, route them to the correct destination, and trigger follow-up actions. This is deployed at meaningful scale in financial services compliance and legal document management. We had a mid-market firm running an agentic workflow on incoming compliance documentation - classifying document type, extracting key data points, flagging exceptions against predefined rules, routing to the appropriate reviewer. It replaced about twelve hours of manual triage per week. But here's the bit that actually made it work: the scope was brutally tight. Six document types. When something arrived that didn't match those six patterns, it escalated to a human. Full stop. The team pushed back on that constraint in week two - "can't we just add a seventh type?" - and we held the line. Three months later, they were glad we did, because the seventh type turned out to have three subtypes with meaningfully different routing logic. That boundary wasn't a limitation. It was the thing that made the whole system trustworthy.

Client communication triage. Systems that receive incoming queries, categorise by urgency and topic, draft responses for human review, and escalate anything above defined thresholds. The triage and routing layer is the bit that works well - genuinely good at sorting incoming volume and making sure the urgent stuff surfaces fast. The draft response capability is useful for routine queries: appointment confirmations, status updates, standard information requests. But autonomous response generation for anything beyond the routine is firmly in "human reviews before it goes out" territory. And it should be.

Each of these has something in common: clearly bounded scope, well-structured input data, and defined handoff points where humans step in. That's not a coincidence. It's the pattern.

What's not ready yet - and I mean it

This is where I think the piece earns its keep, because the temptation to oversell is enormous right now.

Fully autonomous client interactions. The current generation of agentic systems produces plausible-sounding responses to an impressively wide range of queries. That's the problem. They sound confident even when they're wrong. I was reviewing an output last year where the system had confidently cited a regulatory deadline that didn't exist - the date was plausible, the framing was authoritative, and if you didn't already know the regulation well enough to spot it, you'd have sent it straight to the client. In a professional services context, where a confidently wrong answer about a regulatory requirement or a contractual obligation creates real liability, this matters enormously. Human review before any client-facing communication isn't a temporary limitation. It's a design principle.

Complex professional judgement. An agentic system can research, synthesise, and structure information very effectively. What it cannot do - and this isn't going to be solved by the next model release - is replicate the judgement a qualified professional applies to interpreting that output. The analysis can be assisted. The output can be drafted. The judgement must remain human. If you're a partner at a law firm or a director at a consulting practice, your value isn't in assembling the information. It's in knowing what it means and what to do about it. Agentic AI is brilliant at the former and unreliable at the latter.

Unsupervised decision-making in consequential contexts. Any agentic system operating where errors have significant client, financial, or regulatory consequences requires human oversight of its outputs. I'm not saying this because the technology will never get there. I'm saying it because the governance and accountability structures that professional services firms operate under don't currently have a place for "the AI decided." And even when the technology improves, I suspect the oversight requirement will remain - because your clients and your regulators expect a human to be accountable.

These boundaries will shift. What I'm describing as "not ready" in mid-2025 may look different by late 2026. But right now, today, these are the honest lines.

Your infrastructure probably isn't ready either

Even for the viable use cases, there's a prerequisite conversation that most firms skip entirely. And skipping it is, in my experience, the single most common reason agentic pilots fail.

API access to your core systems. An agentic system that can't reliably connect to your document management platform, your CRM, your practice management system, or your communications tools is an agentic system that can't function. This sounds obvious, but I've lost count of the number of firms where the answer to "can we get API access to the DMS?" is either "no" or "we'd need to ask the vendor and it'll take three months." If your core systems don't have reliable, well-documented APIs, you're not ready for agentic AI. Fix that first. The data readiness argument applies with even greater force for agentic applications than for single-task AI, because the system is making autonomous decisions based on whatever data it can access. Bad access, bad decisions.

Structured, accessible data. The research and synthesis tasks that agentic systems do well depend entirely on data that is consistently formatted and queryable. If your knowledge base is a SharePoint folder with 14,000 documents in inconsistent formats and no taxonomy - and I have seen this, more than once, at firms that were genuinely surprised when I pointed it out - the agentic system will produce outputs that reflect that chaos. Garbage in, confidently structured garbage out.

Defined handoff points. The most reliable agentic deployments I've seen in professional services are the ones where someone has sat down and explicitly mapped: here is where the system works autonomously, here is where a human reviews, here is the escalation path when the system encounters something outside its scope. If you haven't done that mapping before you start, you're not piloting agentic AI. You're just hoping.

How to pilot without creating a mess

If you've read this far and you're thinking one of those three viable use cases could work for your firm, here's how to start without creating operational or reputational risk.

Controlled environment first. The pilot operates on a defined subset of tasks - not your whole incoming document flow, but one document type in one practice area. A human monitors every output for the first 30 days before any autonomous processing is permitted. Thirty days. Not a week. I know that feels slow, but the failure modes in agentic systems often only emerge after the system has processed enough edge cases to reveal its blind spots. The firms that rush this phase are the ones that end up with a quietly broken process that nobody notices until something goes wrong with a client.

Structured sample review. After the initial monitoring period, you don't need to review every individual output. But you do need a structured sample review - weekly or fortnightly - that checks for systematic errors, scope drift, or failure patterns. Agentic systems can develop subtle biases in how they route or classify over time, and you won't catch that by spot-checking occasionally.

Explicit escalation triggers. Before the pilot begins - not when the first edge case shows up - define the specific conditions under which the system stops, flags for human review, and does not proceed. Document type it hasn't seen before? Stop. Query that touches a regulatory threshold? Stop. Confidence score below a defined level? Stop. These triggers should be written down, agreed by the project team, and reviewed monthly. If you're defining them reactively, you've already had the incident that should have been prevented.

A failure log that you actually use. Every case where the system produces a wrong, incomplete, or inappropriate output gets recorded, reviewed, and used to refine the scope and the escalation triggers for the next phase. I've seen firms run pilots where problems get quietly fixed in the moment and never logged. That defeats the entire point. The failure log is how you learn whether this thing is getting better or developing new problems.

Where this leaves you

Agentic AI is real. The three use cases I've described - research workflows, document automation, and communication triage - are working in production at firms comparable to yours. If you're dismissing it entirely, you're falling behind a capability curve that's moving faster than most people appreciate.

But the distance between what vendors are marketing and what's actually reliable in a professional services environment is significant. If you're deploying agentic systems without understanding their current failure modes - or without the infrastructure to support them - you're creating risk that could be worse than the problem you're trying to solve.

The useful position is between those two extremes. Know what's viable. Know what's not. Get your data and your systems into shape. Pilot carefully, with proper safeguards, in a bounded environment.

The firms that start this work now - even modestly - will be in a dramatically different position eighteen months from now compared to those still debating whether it's real. That's not a prediction. It's already happening.

For deeper technical implementation detail and a step-by-step deployment roadmap, our Agentic AI at Work guide goes significantly further than I can in a single article. And if you want to assess whether your firm's current data, systems, and governance infrastructure is actually ready to support an agentic AI pilot - and which of the three viable use cases would be the highest-value starting point - book a readiness assessment. We've built an agentic AI readiness quick-check that covers the three infrastructure requirements with a traffic-light rating for each use case. It takes about 90 minutes and gives you a clear "pilot-ready" or "not yet ready" verdict. That's a better starting point than another vendor demo.