Skip to content

How Online Services Actually Break: A Data-Backed Failure Mode Taxonomy

Author:
John Jamie | Jun 20, 2026
Topics

Share This:

When a service goes down, the instinct is to ask what broke \u2014 but the more useful question is how did it break? The pattern of failure tells you more about detection speed, remediation path, and prevention strategy than any single root cause.

We analyzed 178,000+ status page incidents across 360+ online services and 1,037 detailed RCA reports to build a structured vocabulary for how systems fail: the StackGen Failure Mode Taxonomy. 30 named patterns, 8 families, grounded in data.

For the full dataset and interactive benchmarks, see the StackGen State of Reliability 2026 report.

Why Naming Failure Modes Matters

Two incidents with identical root causes can look completely different depending on how the failure propagated. The failure mode is the pattern. The root cause is the trigger. Confusing them leads to the wrong remediation.

The 8 Families

Family 1: Propagation Failures (28%)

FM-01 Cross-Org Cascade (4,516 incidents in 2025): AWS us-east-1 Oct 2025 hit 137 downstream companies in 24 hours. CrowdStrike July 2024 hit banking, airlines, and healthcare. Azure Oct 2024 cascaded across European operators. Also: FM-23 Hidden Internal Coupling (579 incidents).

Family 2: Tail / Outlier Failures (3%)

FM-06 Aggregate-Masked Tail Degradation: aggregate metrics look healthy while P99 spikes. FM-35 In-Flight Compatibility Break (481): partial rollouts create version-boundary failures.

Family 3: Change-Induced Failures (31% \u2014 the largest)

FM-09 Deploy-Induced Regression (3,764 incidents): Anthropic sees 73% of incidents map to specific model version regressions. FM-10 Config-Induced Failure (1,763): Cloudflare Nov 2025 global outage \u2014 a ClickHouse permissions change doubled a feature file size.

Family 4: Capacity and Resource Failures (13%)

FM-13 Resource Exhaustion (2,185): connection pool, CPU, memory, disk. FM-25 Autoscaling Pathology (231). Both are Tier C \u2014 highly automatable.

Family 5: External / Adversarial (1%)

FM-15 External Attack. FM-16 Supply-Chain Breach: Salesloft/Drift OAuth breach Aug 2025 \u2014 70+ day incident.

Family 6: AI-Specific Failure Modes (2%)

FM-17 AI Service Output Quality Degradation (250): LLMOps, MLOps, DataOps, DevOps causes. FM-33 GPU / Accelerator Fleet Heterogeneity (Provisional). The 2% share understates reality \u2014 training pipeline failures rarely surface on status pages.

Family 7: Recovery / Process Failures (11%)

FM-21 Phased Data Recovery (2,315): Datadog June 2025 \u2014 2.1 days. OpenAI Compliance API delays July 2025 \u2014 5.4 days.

Family 8: Foundational Integrity (12%)

  • FM-26 Silent Data Corruption (859): service up, data wrong
  • FM-30 Control Plane Failure (836): AWS Oct 20 + GCP Jun 2025
  • FM-27 Monitoring Blind Spot (375): customer reports first
  • FM-31 State Divergence (196)

The 55/28/17 Autonomy Split

55% of incidents fall into AI-Closed modes. 28% need AI-Augmented human judgment. 17% remain Human-Led.

Key Takeaways

  • 31% of incidents trace to change-induced patterns \u2014 highest automation value
  • 28% are propagation failures \u2014 cross-org cascades. Detection speed is the lever.
  • Foundational integrity (12%) is hardest \u2014 requires human judgment on data truth
  • AI-specific modes under-counted at 2% due to disclosure gaps

Explore the data at stackgen.com/state-of-reliability. Sign up for the LinkedIn webinar.

About StackGen:

StackGen is the pioneer in Autonomous Infrastructure Platform (AIP) technology, helping enterprises transition from manual Infrastructure-as-Code (IaC) management to fully autonomous operations. Founded by infrastructure automation experts and headquartered in the San Francisco Bay Area, StackGen serves leading companies across technology, financial services, manufacturing, and entertainment industries.

All

Start typing to search...