The Cascade Tax: Why 1 in 5 Incidents Is Caused by a Provider You Don't Control

Written by John Jamie | Jun 20, 2026 2:06:40 AM

Your monitoring is green. Your code hasn't changed. Your infrastructure looks fine. And yet your status page is lighting up.

Welcome to the cross-org cascade (FM-01) \u2014 the most common single failure mode in the StackGen State of Reliability 2026 dataset, at 22% of all classified unplanned incidents across 360+ online services in 2025. It's not your bug. It's your dependency.

We analyzed 178,000+ status page incidents and 1,037 engineering post-mortems to understand how cascades work, how fast they spread, and what separates teams that resolve them in 30 minutes from those that spend four hours investigating the wrong service. Full data: stackgen.com/state-of-reliability.

What Is a Cross-Org Cascade?

A failure inside a shared infrastructure provider propagates to downstream operators. The downstream operator's own systems are not the cause. The primary lever is recognition speed \u2014 the faster your team identifies this is a cascade, the faster you shift to the right posture: communicate, degrade gracefully, and wait.

The Scale: 309 Minute Median MTTR

In 2025, FM-01 accounted for 4,516 incidents. Median recovery time: 309 minutes \u2014 3.2x longer than an internally-caused config failure (~97 minutes). The gap isn't because cascades are technically harder to fix. It's because recognition time is slow.

The Three Major Cascades in the Dataset

AWS us-east-1 \u2014 October 20, 2025

A DynamoDB DNS automation race condition created an empty DNS record. Within 24 hours, 137 downstream companies posted status page incidents \u2014 the largest single-event cascade in the dataset. Most incident titles named \u201cAWS us-east-1\u201d verbatim. Recovery ranged from under an hour (multi-region failover) to 8+ hours (teams that first spent time ruling out internal causes).

CrowdStrike Falcon Sensor Update \u2014 July 19, 2024

A faulty content configuration update to the Falcon sensor caused Windows BSODs. Airlines, banks, hospitals, and government agencies all posted separate status page incidents tracing to the same third-party update. The clearest illustration of how a security tooling dependency creates cascade exposure at cloud-infrastructure scale.

Azure \u2014 October 2024

A networking misconfiguration in Azure's European infrastructure cascaded across dozens of fintech, SaaS, and communications operators using Azure Front Door, most within the same 2-hour window.

The Upstream Dependency Map

Upstream Type	Examples	Why It Cascades
Cloud	AWS, GCP, Azure	Broad dependency across every service tier
CDN	Cloudflare, Fastly	Traffic routing; outage hits end users directly
Identity	Okta, CrowdStrike	Auth-wall; outage blocks user access entirely
AI Provider	OpenAI, Anthropic	Growing as AI APIs embed in product features
Dev-Tooling	GitHub, Docker Hub	Deployment pipeline; blocks releases

What Good Teams Do Differently

Cascade detection as first reflex \u2014 check upstream health before investigating internal services
Pre-defined degraded-mode behavior \u2014 circuit breakers, cached fallbacks, graceful degradation pages built before the incident
Faster user communication \u2014 you can't fix the upstream, but you can tell users what's happening

Key Takeaways

22% of 2025 incidents are cross-org cascades \u2014 growing as cloud and AI API dependencies deepen
Median MTTR is 309 minutes \u2014 3.2x an internal config failure. Most of that gap is recognition time.
AWS Oct 2025, CrowdStrike Jul 2024, Azure Oct 2024 are the three largest cascade events in the dataset
The long-run lever: multi-vendor failover for your top 5-10 upstream dependencies

stackgen.com/state-of-reliability | LinkedIn webinar

View full post