When a service goes down, the instinct is to ask what broke \u2014 but the more useful question is how did it break? The pattern of failure tells you more about detection speed, remediation path, and prevention strategy than any single root cause.
We analyzed 178,000+ status page incidents across 360+ online services and 1,037 detailed RCA reports to build a structured vocabulary for how systems fail: the StackGen Failure Mode Taxonomy. 30 named patterns, 8 families, grounded in data.
For the full dataset and interactive benchmarks, see the StackGen State of Reliability 2026 report.
Two incidents with identical root causes can look completely different depending on how the failure propagated. The failure mode is the pattern. The root cause is the trigger. Confusing them leads to the wrong remediation.
FM-01 Cross-Org Cascade (4,516 incidents in 2025): AWS us-east-1 Oct 2025 hit 137 downstream companies in 24 hours. CrowdStrike July 2024 hit banking, airlines, and healthcare. Azure Oct 2024 cascaded across European operators. Also: FM-23 Hidden Internal Coupling (579 incidents).
FM-06 Aggregate-Masked Tail Degradation: aggregate metrics look healthy while P99 spikes. FM-35 In-Flight Compatibility Break (481): partial rollouts create version-boundary failures.
FM-09 Deploy-Induced Regression (3,764 incidents): Anthropic sees 73% of incidents map to specific model version regressions. FM-10 Config-Induced Failure (1,763): Cloudflare Nov 2025 global outage \u2014 a ClickHouse permissions change doubled a feature file size.
FM-13 Resource Exhaustion (2,185): connection pool, CPU, memory, disk. FM-25 Autoscaling Pathology (231). Both are Tier C \u2014 highly automatable.
FM-15 External Attack. FM-16 Supply-Chain Breach: Salesloft/Drift OAuth breach Aug 2025 \u2014 70+ day incident.
FM-17 AI Service Output Quality Degradation (250): LLMOps, MLOps, DataOps, DevOps causes. FM-33 GPU / Accelerator Fleet Heterogeneity (Provisional). The 2% share understates reality \u2014 training pipeline failures rarely surface on status pages.
FM-21 Phased Data Recovery (2,315): Datadog June 2025 \u2014 2.1 days. OpenAI Compliance API delays July 2025 \u2014 5.4 days.
55% of incidents fall into AI-Closed modes. 28% need AI-Augmented human judgment. 17% remain Human-Led.
Explore the data at stackgen.com/state-of-reliability. Sign up for the LinkedIn webinar.