AWS Outage Fallout: Lessons In Resilience

How bad was the latest AWS outage and what can security leaders learn from it?

In October, Amazon Web Services (AWS) suffered its biggest outage in years, crippling many applications and disrupting businesses across the globe. A week later, Microsoft Azure suffered an outage too, impacting many global services.

Both outages were attributed to issues with domain name system (DNS), the naming system used by the internet to match services and websites with IP addresses.

It comes as web infrastructure provider Cloudflare suffered an outage in November, causing widespread disruption for the services that depend on it.

The AWS outage and others like it have led to warnings about the importance of resilience, as well as the issues of relying on one cloud provider. How bad was the latest AWS outage and what lessons can security leaders learn from it?

Not As Bad As CrowdStrike

The AWS outage was big, but it wasn’t as bad as the issues that affected CrowdStrike after a botched security update in 2024. Thae CrowdStrike incident hit 8.5 million devices, shut down airports and hospitals and cost around $10 billion, says Camden Woollven head of strategy and partnership marketing at GRC International Group. “The AWS outage was smaller – about 70,000 organisations affected, losses between $38 million and $581 million.”

However, the AWS October outage was bigger than the one it famously suffered in 2021. The 2021 outage lasted around eight hours and was “significantly smaller” in terms of global impact, according to Peter Jones, cybersecurity specialist at Conscia UK. “Overall, the October 2025 outage can be viewed as one of the largest service disruptions we’ve observed in the last few years.”

What made the 2025 outage particularly severe was the breadth and speed of the disruption, which highlighted “the cascading effects of dependencies in modern cloud environments,” says Kashif Nazir, senior technical architect at Cloudhouse.

After a major outage, readiness tends to spike, but as time passes, “resilience fatigue” can set in, he warns. “The Azure outage just a week later demonstrates that these kinds of issues can happen more frequently – and as cloud services become more centralised, they will affect even more people.”

Multi-Cloud Set Ups

The impact of the AWS outage has led to multiple warnings about the issues when relying on one cloud provider. But experts warn it’s important to keep in mind that moving to multi-cloud can also cause problems.

Multi-cloud is “not the default answer,” says Ryan Gracey, partner and technology lawyer at law firm Gordons. “For a few crown jewel services, splitting across providers can reduce single-supplier risk and satisfy regulators, but it also raises cost and complexity, and opens new ways to fail. Chasing a lowest common denominator setup often means giving up the very features that make cloud attractive.”

For most organizations, a balanced approach is best, according to Gracey: “Selective multi-cloud for the vital few, and deep mastery of one cloud for everything else.”

But this won’t be right for all firms. As with most decisions, the answer depends on the nature of your business and your risk profile, Jones concedes. “For the majority of organizations, considerations such as in-house skills, economies of scale and ease of management provide significant advantage and therefore offset any potential risk of using a single platform approach.”

For others, where continuous operations is critical, a primary cloud platform with back-up to a secondary cloud or on-premises service may be appropriate, says Jones. “This approach adds complexity to define handover procedures and can be relatively expensive, however the resilience achieved should deliver assurance.”

Resilience Lessons

One of the most crucial lessons from outages is the need for resilience. Outages are inevitable. This makes it important to “avoid panic migrations” following the outage and instead strengthen business continuity planning and incident response testing.

These elements are “key ISO 27001 and EU Digital Operational Resilience Act principles,” says Chris Newton-Smith, CEO at I/O. The incident also underscores the importance of “clear accountability and visibility across third-party service providers” – a growing requirement under resilience regulations such as Network and Information Systems Directive 2, he adds.

The takeaway from the latest outage is not just to buy more redundancy, says Gracey. “It’s about designing systems that bend, not break. They should slow down gracefully, drop non-essential features and protect the most important customer tasks when things go wrong. A part of this is running drills so teams know who decides what actions to take, what to say to customers and what to do first.”

For the cloud service provider, it’s important to recognise where a potential single point of failure – or “race condition” in the case of AWS – may exist, says Jones. “AWS will be looking at its architecture to ensure single points of failure are eliminated and the potential blast radius of any incident is dramatically reduced.”

For customers, it’s important to fully understand the supply chain and how your services are being delivered, Jones adds. “Many third-party cloud applications rely on AWS or one of the other major cloud providers to deliver their services, therefore many of an organization’s applications may be dependent on one specific platform. For critical systems, it’s essential you understand these dependencies and ensure resilience across multiple providers or regions.”

Keeping Operations Up And Running During Outages

Outages will continue to happen, but there are a few steps businesses can take to prepare.

Maintaining operations during outages requires “architectural and operational preparation,” says Nazir. “Services should be distributed across multiple regions, with automatic failover and buffering mechanisms to prevent lost data. Monitoring, emergency access, and communication tools should remain independent of affected systems.”

Meanwhile, teams should regularly run simulated failure exercises to practice response and recovery, says Nazir.

During an outage, “discipline beats luck,” says Gracey. “Customer-facing flows should be separated from back-office jobs so they don't fight for resources. More than one zone should be used, and failover and recovery tested.”

Automatic retries need to be controlled to prevent a minor hiccup from snowballing, according to Gracey. “Put one person in charge, communicate clearly and often, and have prepared messages ready. Above all, invest in resilience where it protects the moments that matter most to customers, and accept that some risk will remain, but it can be managed with preparation.”

Written by