Resilience Lessons Learned From the CrowdStrike Incident

What lessons have been learned following one of 2024’s most memorable security incidents and how should CISOs respond?

On July 19, IT admins across the globe arrived at work to be faced with the dreaded Windows blue screen of death (BSOD). As the morning unfolded, it became clear that the issue wasn’t Microsoft’s Windows itself, but a botched update from security vendor CrowdStrike.

It was one of the worst incidents of the year, mainly due to how much disruption was caused. No hacking was involved, but flights were grounded and TV channels including the UK’s Sky News were unable to broadcast.

CrowdStrike responded to the incident promptly, providing workarounds to try and fix the issue. But for many, the update had already started, and some firms found they had to intervene manually leaving them working into the weekend and beyond.

While it was a major event that everyone, including CrowdStrike and Microsoft, did not want to repeat, lessons have been learnt across the board. A few days after the outage, CrowdStrike published its preliminary Post Incident Review (PIR) outlining how an error in the software that tests its Falcon Sensor updates was responsible for the issue.

Since July, CrowdStrike has made changes to the way it performs updates, giving companies more control over how they apply them. At the same time, the firm has improved its quality assurance and testing process to ensure it doesn’t happen again.

Meanwhile, Microsoft is looking at ways to better manage the kernel level access given to security products that could have contributed to making the CrowdStrike incident more damaging.

So, what lessons can businesses learn following one of 2024’s most memorable security incidents and how should CISOs respond?

Monoculture is Dead

The CrowdStrike incident highlighted some broader issues that all companies are facing. One example is monoculture which sees firms employ one security solution for all servers and clients, as highlighted by SentinelOne CISO Alex Stamos.

The wider lessons from the CrowdStrike incident centre around “how to best reduce single points of failure in case of similar events”, says Jeff Watkins, chief technology officer at CreateFuture.

Modern businesses have become deeply interconnected, so a failure in a single, highly relied-upon system “can cascade across industries”, says Davis DeRodes, a lead data scientist at Fusion Risk Management.

Organisations have now learned that relying on a single provider can leave them exposed to systemic risk, he says. “The widespread disruption forced many businesses to ask, ‘are we too dependent on this single point of failure? Are our vendors also dependent on those same single points of failure?’.”

Contingency Planning

The incident also underscores the need for more resilience in technology environments, says Jitendra Nandwani, SVP and head of cloud, infrastructure and security services at Zensar Technologies. “Organisations must design systems that can withstand failures without significantly impacting operations. This includes implementing failover mechanisms and ensuring that critical services have appropriate backup systems in place.”

But some firms are still lacking in this area. The CrowdStrike outage exposed a gap in many organisations' scenario testing and contingency planning, says DeRodes. “Too often, organisations plan for predictable, small-scale disruptions, but the CrowdStrike incident showed that even highly-reliable systems can fail – and with far-reaching consequences.”

It is “reasonably plausible” that any system could fail at any time, and businesses need to prepare for this, he says. “Firms now understand their resilience strategies must account for this wide breadth of scenarios – in particular the ones where the impact would be devastating.”

IT teams often understand the technical risks, but the operational fallout extends across customer service, logistics, finance and beyond, he points out. “Preparing for a scenario isn’t just about recovery; it's understanding the implications of the time to recover. You need to ask yourself, what key services will be impacted? Which customers might be impacted?”

Risk Management

The CrowdStrike outage serves as a reminder of the need for businesses to take a more critical approach to risk management, says David Sant, senior commercial technology and data protection solicitor at Harper James.

Sant advises firms to identify the practical and regulatory risks of system failures by assessing each business-critical system, evaluating the impact of any failure, and ensuring the risks are documented and updated in a risk register. “That allows a business to ask itself whether it has appropriate insurance, what mitigation measures it could put in place, and what budget is worth allocating to these measures.”

CrowdStrike has made changes following the incident to ensure its processes are better. “We are grateful for the strong support and trust our customers and partners have placed in CrowdStrike,” a CrowdStrike spokesperson says. “Our focus continues to be on using the lessons learned from July 19 to better serve them.”

But these lessons extend to security leaders too. CISOs should also look to implement rigorous testing protocols before deploying updates across the infrastructure, a Gregory Richardson, VP global advisory CISO at BlackBerry Cyber. If a vendor manages this process, it is essential to inquire about their remediation plans for problematic updates, he says.

At the same time, firms should identify any suppliers that could impact the proper functioning of their business-critical systems, Sant says. “The contracts with these suppliers should allocate the risks of system failures fairly, through appropriate warranties, indemnities, liability caps and insurance requirements.”

The wording of the liability clauses really matters, he says. “A high liability cap may seem attractive, but it could be worthless if a different part of the clause excludes loss of profits and wasted expenditure.”

Written by