CrowdStrike: What Happens Now?

How could such a small software change cause an entire operating system to fall over and should testing and quality assurance at a large security firm have been a lot more thorough?

In July, organisations across the world were left reeling after an error in a CrowdStrike update crashed millions of Microsoft Windows machines. Once the update had taken place and computers were displaying blue screen of death (BSOD), the damage was almost impossible to undo. IT admins struggled all day and into the weekend to apply the fix, in many cases manually.

The outage hit companies around the world, including airlines and news services such as Sky News, which was so badly affected it was unable to broadcast.

A few days after the outage, CrowdStrike published its preliminary Post Incident Review (PIR) outlining how an error in the software that tests its Falcon Sensor updates was responsible for the issue.

Nearly all affected Windows computers are now back up and running, according to CrowdStrike’s CEO George Kurtz.

CrowdStrike says it will learn from the incident, with steps outlined in its PIR including an improved and staged roll out process. But the damage has already been done, to the firm’s reputation – and to the companies who lost revenue during the massive outage.

Over the last two weeks, the fallout has continued, with a class action lawsuit launched by a CrowdStrike investor pension fund – which says the firm failed to test its software properly. Meanwhile, the CEO of Delta Air Lines has condemned the firm and blamed it for losses amounting to $500m.

It's fair to say the outage raises multiple questions. How could such a small software change cause an entire operating system to fall over, are auto-updates really necessary for non-security related changes – and should testing and quality assurance (QA) at a market-leading firm have been a lot more thorough?

The Updates Question

The CrowdStrike update that led to Windows BSOD was a content upgrade intended to keep monitoring software up to date with the latest threats. While this should be prompt, it has to be balanced with ensuring updates do not disrupt operations, says Brian Honan, CEO at BH Consulting.

“A lot of trust is placed in security vendors to ensure appropriate measures, such a robust quality assurance and regression testing, are in place to prevent major compatibility issues within their client base.”

Even a staged rollout would have helped prevent the CrowdStrike issue from being so wide-reaching. While staggered rollouts would not have completely prevented an issue from occurring, it could have limited the number of systems affected, says Sean Wright, head of application security at Featurespace.

“As soon as issues started arising, further rollout of the update to other devices could have been halted,” Wright says. “Additionally, rollout to a pre-staging area that mimics a production environment would have likely caught the issue before it was issued to production instances.”

Speed doesn’t have to come at the expense of stability. There are many techniques for deploying software and configuration updates quickly and safely, agrees James Kretchmar, SVP and CTO, cloud technology group at Akamai. “Staggered deployment techniques would have helped prevent this incident. With a staggered deployment, you issue the update first to a small pool of devices, observe the effects via telemetry, and then only proceed to wider stages of deployment when it’s clear the effects have been positive.”

Testing and QA will be key to CrowdStrike to ensure an incident like this never happens again. Going forward, ensuring best practices around software development and updates is essential, says Wright. “We have decades of previous failures to lean on, and mechanisms that have been put into place to prevent these very types of scenarios. Testing, especially in a production like environment, is imperative.”

Preparing for the Worst

The CrowdStrike incident also shows how important it is that businesses have plans for outages of this size. “For me the big lesson to be learnt from the CrowdStrike event is how confident are you with your business continuity and cyber-resilience plans?” says Honan.

He says plans should include a complete outage of your IT environment, for whatever reason. “Then you need to assess, have you thoroughly tested those plans? What workarounds, manual or otherwise, have you got in place to continue providing services to your clients?”

Ensuring resilience also means working with your third partner vendors and service providers to determine how robust their plans are. For example: “If they were to have an outage how would that impact your business?,” asks Honan.

In the end, preparation is key. The industry will never entirely eliminate service outages, but their likelihood, severity and impact depend on actions taken in advance, says Kretchmar. He advises firms to look across their business and technologies and “think hard about what could go wrong that you haven’t considered”.

This needs to be done regularly: Reliability and business continuity planning should never be a “once and done” task, Kretchmar says. “It requires a persistent effort, where you’re regularly reviewing the different areas of your technology stack, communicating with your partners and vendors, assessing the risks and working to minimise them.”

Written by