How resilient is your digital infrastructure? (Probably less than you think.)

On 19 July 2024, CrowdStrike pushed a faulty update to 8.5 million Windows machines worldwide. Airlines grounded flights. Hospitals postponed surgeries. Banks couldn't process payments. The estimated cost? North of $5 billion globally. The largest IT outage in history - and it wasn't caused by a cyberattack. It was caused by a configuration file.

That story dominated the news for about a week. Enterprise CIOs scrambled. Boards convened emergency sessions. And then, as these things tend to, it faded. The big firms patched, recovered, published their post-mortems, and moved on.

But I keep coming back to one question. If a single bad update can bring Delta Air Lines to its knees - a company with a dedicated IT workforce numbering in the thousands, redundancy across every layer, and contracts with every major infrastructure vendor on the planet - what would a similar incident do to your firm?

Not hypothetically. Specifically. Your firm. With your two-person IT team, your seven-year-old CMS, your single hosting provider, and the one developer who knows how everything connects because they built it in 2018 and nothing's been documented since.

That's the question most mid-market B2B service firms haven't properly asked. And honestly, I think it's because the answer is frightening enough that it's easier not to.

We've never had a major outage. Our systems are reliable enough.

I hear this constantly. And look, it might be true - you might have been genuinely lucky. But "we've never had a problem" isn't a resilience strategy. It's survivorship bias dressed up as operational confidence. The firms that get hurt worst by platform failures aren't the ones that planned for them and got unlucky. They're the ones that never thought it would happen to them.

The mid-market exposure problem

Enterprise firms invest in resilience the way they invest in insurance - systematically, with dedicated budget lines and specialist teams. They run chaos engineering exercises. They maintain hot standby environments. They have disaster recovery plans that get tested quarterly, not filed in a drawer.

You probably don't have any of that. And I'm not saying that to be harsh - you don't have it because it's expensive, you've had other priorities, and nobody's made a compelling case for spending money on preventing something that hasn't happened yet. I get it. But the consequence is that mid-market firms carry a disproportionate amount of infrastructure risk relative to their size.

Enterprise resilience is built on redundancy - multiple servers, multiple data centres, multiple people who understand every system. When one thing fails, another thing catches it. Mid-market firms, by contrast, tend to have single points of failure everywhere. One hosting provider. One CMS administrator. One person who understands the integration between the website and the CRM. One backup process that nobody's tested since it was set up.

I was with a 200-person consulting firm earlier this year. Good firm, strong reputation, decent revenue. Their entire digital presence - website, client portal, document management - ran on a single virtual machine at a hosting provider they'd been with since 2016. No failover. No load balancing. One instance. When I asked what would happen if that server went down on a Monday morning, the IT manager paused for longer than you'd want and said, "We'd ring them and hope it came back quickly."

That's not a plan. That's a prayer.

What downtime actually costs you

When enterprise firms experience outages, the cost is measured primarily in lost transactions. Revenue per minute. It's big, it's visible, and it's recoverable - because their customers are usually locked into contracts or ecosystems that make switching impractical in the short term.

For a mid-market B2B service firm, the cost calculus is completely different. And worse.

Think about what happens if your website goes down for 48 hours. You lose whatever inbound enquiries would have come in during that window - and you'll never know how many, which is part of the problem. But that's the small bit. The real damage is what happens to client confidence.

If you're a law firm and your client portal is inaccessible during a deal completion, that client is going to remember. If you're a consulting firm and your project extranet disappears the week before a board presentation, your client is going to wonder what else might fall over. If you're a financial services firm and your compliance reporting tool goes dark during a regulatory review - well, I don't need to finish that sentence.

We worked with a specialist commercial lender a while back where 40% of broker applications contained errors because the portal was clunky and unreliable. Brokers didn't complain. They just quietly placed their business elsewhere. The cost wasn't visible in an outage report. It showed up as a slow bleed in application volumes that took months to diagnose.

Resilience failures follow the same pattern. Your clients won't send you a formal complaint. They'll just start thinking of you as the firm that's a bit... unreliable. And that thought will be sitting in the back of their mind next time they're choosing who to work with.

A single significant outage at a mid-market firm can easily cost tens of thousands in direct lost productivity and emergency remediation. The bit that's hardest to quantify - the trust deficit that lingers long after the systems come back online - is usually the most expensive part.

The platform age problem

Here's where this connects to something bigger. Older platforms aren't just missing features or looking dated. They're structurally less resilient.

A CMS installed in 2017 is running on dependencies that have aged seven years. PHP versions no longer actively supported. Plugins that haven't been updated because the developer moved on. Server configurations that made sense at the time but don't account for current traffic patterns or security threats. The integration with your CRM that was hand-coded by a contractor who left three years ago and whose mobile number no longer works.

Every year that passes, the platform accumulates what the industry calls technical debt. But the resilience dimension of technical debt is the one that gets least attention. Old platforms aren't just slow or frustrating to use - they're increasingly fragile. The number of things that can go wrong grows, while the number of people who understand how to fix them shrinks.

I spoke to the CTO of a mid-sized accountancy firm last year who described their platform situation as "a Jenga tower that nobody's allowed to touch." Everyone knew it was precarious. Nobody wanted to be the person who pulled the wrong block. So they just... left it. Added workarounds. Taped things together. And hoped.

That's not an unusual story. I'd guess at least half the mid-market firms we talk to are in some version of it. The platform works, mostly, until it doesn't. And when it doesn't, the recovery is brutal because nobody fully understands the system anymore.

If your CMS vendor has stopped issuing security patches and you're pretending that's fine - it's not fine.

Finding your single points of failure

Right. Enough doom. Let's talk about what you can actually do.

The first step is an honest audit of where your single points of failure are. And I mean genuinely honest - not the version you'd present to the board, but the version you'd admit to after a second glass of wine.

Start with hosting and infrastructure. How many environments are you running? Is there a failover? If your primary hosting provider has an outage, what actually happens? And when was the last time anyone tested the backup restore process? If the answer is "never" or "I'm not sure," that's your first action item - because I cannot tell you how many firms discover their backup process is broken only at the exact moment they need it to work.

Then there's access and credentials. Who has admin access to your CMS, your hosting control panel, your DNS provider, your SSL certificates? Is it documented somewhere that isn't one person's head? I've seen firms locked out of their own domain because the person who registered it left the company and used their personal email. That is not a fun Tuesday morning.

Documentation is the one everyone skips. Is there a current architecture diagram for your digital estate? Do you know what integrations exist, what data flows between systems, and what breaks if any one system goes down? I had a client last year - a 180-person professional services firm - where the only person who understood the full stack had been there eleven years and had never written any of it down. When I asked why, he shrugged and said, "Nobody ever asked." Nobody asked because nobody wanted to think about what happened if he got hit by a bus. Which is, of course, exactly the problem.

Dependency mapping is worth doing properly too. Which third-party services does your platform depend on? Payment processors, email services, analytics, search, CDN, API connections to your CRM or ERP. Each one is a potential failure point. Have you checked their uptime SLAs recently?

And then there's the one people find most uncomfortable: key-person dependencies. If your IT manager, or your one developer, or the agency contact who knows your setup - if they were suddenly unavailable for two weeks, what would happen? Could someone else step in? One thing I'd add here: make sure you have independent access to everything - hosting, source code, DNS. I've seen relationships with agencies break down and firms discover they don't actually own their own infrastructure. That's a conversation you want to have before there's a crisis, not during one.

If you're going through all that and feeling slightly queasy, that's actually a good sign. It means you're being honest. Most firms that tell me they're "probably fine" haven't done this exercise properly.

Practical resilience at mid-market scale

I want to be realistic here. I'm not going to tell you to build a hot standby environment across three data centres and hire a dedicated site reliability engineering team. That's enterprise thinking applied to mid-market budgets, and it doesn't work.

But there are things you can do that cost relatively little and meaningfully reduce your exposure.

Automated, tested backups. Not just "we have backups" - actually tested backups. Schedule a restore test quarterly. It takes half a day. And while you're at it, make sure your recovery documentation isn't stored on the server that's just gone down. I wish I was joking about that one. I've seen it more than once.

Basic uptime monitoring costs almost nothing - services like UptimeRobot or Pingdom will tell you when your site goes down before your clients do. You'd be amazed how many firms find out about outages from a client email rather than their own monitoring. Because they don't have any monitoring.

Never push updates directly to production without a staging environment. Always have a plan for rolling back if something goes wrong. The CrowdStrike incident was essentially a failure of this principle at massive scale. At your scale, it's much easier to get right.

And review your hosting contract. Understand what "99.9% uptime" actually means in practice - it's about 8.7 hours of downtime per year, which sounds fine until all 8.7 hours happen on the same Wednesday. Understand what compensation you're entitled to and what the provider's obligations are during an incident.

None of this is glamorous. None of it will feature in your annual report. But it's the difference between a bad day and a catastrophic one.

When resilience becomes a platform conversation

There's a point where resilience measures stop being about process and start being about architecture. If your platform is old enough that the underlying technology is unsupported, or fragile enough that every update feels like Russian roulette, or so poorly documented that only one person can maintain it - then bolt-on resilience measures are sticking plasters.

At that point, the conversation shifts from "how do we make this more resilient?" to "do we need a different foundation?" That's a legitimate business case conversation, not just a technology one. If a resilience assessment reveals significant risk, the investment case for platform modernisation should include risk reduction alongside capability improvement - and framing it that way tends to land very differently in a board conversation than "we need to upgrade our CMS."

The cost of prevention is almost always a fraction of the cost of recovery. A proper resilience assessment - the kind we run in about two weeks - typically costs less than a single day of unplanned downtime when you factor in emergency contractor fees, lost productivity, client communications, and the quiet damage to your reputation.

But we've managed fine so far.

You've managed fine so far with the risks you know about. The ones you don't know about are the ones that'll get you.

One last thing

I was chatting to a managing partner at a 150-person law firm a few months back. Smart woman, commercially sharp, runs a tight ship. I asked her what would happen if their client portal went down for 48 hours during a transaction. She thought about it for a moment and said, "Honestly? I don't know. And the fact that I don't know is probably the answer."

That's exactly right. If you can't describe what happens when your systems fail, you don't have a resilience plan. You have an assumption.

If you're not sure how exposed you are, we can run a platform resilience assessment in two weeks. It won't tell you everything is fine - almost nothing ever is - but it'll tell you where the real risks are, what to fix first, and what it'll cost. Which is a significantly better position than finding out the hard way.

How resilient is your digital infrastructure? (Probably less than you think.)

The mid-market exposure problem

What downtime actually costs you

The platform age problem

Finding your single points of failure

Practical resilience at mid-market scale

When resilience becomes a platform conversation

One last thing

You Might Also Like

US Office

UK Office

How resilient is your digital infrastructure? (Probably less than you think.)

The mid-market exposure problem

What downtime actually costs you

The platform age problem

Finding your single points of failure

Practical resilience at mid-market scale

When resilience becomes a platform conversation

One last thing

You Might Also Like

Get insights that drive results