StackGen Research · 2026

State of Reliability 2026

What's driving incidents, what SREs are doing about them, and how AI is reshaping both — from the largest public status-page dataset assembled.

177,960 status-page entries

390+ companies tracked

13 sectors analysed

2018–2026 coverage window

Five key findings

Finding 01

MTTR clusters into three tiers — your category predicts it more than the year does

Resolution time sorts into three stable bands, each roughly flat since 2023.

0.8h · 1.7h · 3–4h Read the chapter →

Finding 02

Your company matters about 3× more than your industry

Company explains ~20 pts of MTTR variance; industry only ~8. The lever is Response Maturity.

≈3× Read the chapter →

Finding 03

Most teams fit one of six incident archetypes — sticky, but with a maturation arrow

60.7% of firms stay in the same archetype 2023–2025; movers drift toward longer-tail failure modes.

6 archetypes Read the chapter →

Finding 04

More than 1 in 4 incidents is a cascade you don't control

Cross-org cascade resolves 3× slower than internal config failures — 247 min vs 96 min.

>1 in 4 Read the chapter →

Finding 05

AI-related incidents crossed 10% in 2026 YTD — roughly a 6× rise in three years

AI appears as upstream failure, model-quality issue, and autonomous agents destroying production systems.

10.7% Read the chapter →

Chapter 1 · MTTR benchmarks

MTTR clusters into three tiers — and your category predicts it more than the year does

When we mapped resolution times across 390+ companies, sectors didn't spread randomly — they grouped into three distinct bands that have each held roughly flat since 2023.

AI providers · 0.8h Application · 1.7h Industry-infrastructure · 2.6h

Median MTTR by category tier, 2023–2026 YTD. Source: StackGen SSOR 2026, dataset v16.43, 234-company same-cohort lens. Excludes maintenance and advisory postings >7 days.

Application tier

~1.7h

Developer tools · consumer internet · SaaS and business software · observability and monitoring

Industry-infrastructure tier

3–4h

Cloud infrastructure · communications · fintech and payments · security and identity

AI Model Providers

0.8h ↓

Fastest sector in the dataset — 49 min median in 2026 YTD, down from 75 min in 2023

When we analysed median resolution time by sector, MTTR didn't spread randomly — it organised into three distinct tiers. This held across every year from 2023 through 2026 YTD, which means industry tier explains where you land more reliably than any year-over-year trend does. The sectors within each tier share structural features that drive their recovery times, not just their incident rates.

Application-tier sectors — developer tools, consumer internet, SaaS and business software, observability, and e-commerce — resolve at around 1.7 hours. These sectors tend to run fast rollback paths, have blast radius contained within their own product, and staff teams built for rapid deploy-and-fix cycles. The newly broken-out AI Application sector sits in this tier.

Industry-infrastructure sectors — cloud infrastructure, communications, fintech and payments, and security and identity — take 3 to 4 hours. Three structural features lengthen response here: failures cascade into many downstream systems; resolution typically requires coordination with regulators, carriers, enterprise customers, or partners; and the cost of an incorrect remediation is high, so teams verify before acting. Communications is the slowest sector in the tier, partly because several providers disclose per-route incidents, inflating volume and widening their tail.

AI Model Providers sit in a distinct third band — operationally faster than Application-tier despite being functionally infrastructure. Median MTTR fell from 75 minutes in 2023 to 49 minutes in 2026 YTD. Three features drive this: most outages can be handled by rerouting to a sibling model or cached response; teams are staffed at scale-up pace with fast deployment pipelines; and per-token latency, error-rate, and output-quality telemetry give unusually rich signals that compress time-to-diagnosis.

What this means for SREs

Set your benchmark against your tier, not the dataset-wide average. A 3-hour Cloud Infrastructure median is tier-typical — not under-performing. The more useful question is where you sit within your tier, and what the fastest companies in your tier do differently.

Chapter 2 · Company vs Industry

Your company matters about 3× more than your industry

Industry sets a floor on MTTR — but your team's Response Maturity sets the ceiling. Two SREs in the same sector can differ 3× in recovery speed.

Company (~20 pts) FM + root cause (~7 pts) Industry (~8 pts) Unexplained noise (~65 pts)

Share of total MTTR variance explained by each factor. Dataset v16.43. Variance decomposition pinned to findings-core v1.0.

Company explains

~20 pts

of MTTR variance — the largest single lever available

Industry explains

~8 pts

About 3× less than the company itself

Within-tier gap

up to 3×

Two SREs in the same industry can differ 3× in recovery speed

Industry membership explains roughly 8 percentage points of MTTR variance. The company itself explains ~20 points — about three times more. The remaining ~65 points is incident-level noise no observable predictor captures (which on-caller answered the page, time of day, whether the right runbook fit).

The company-level driver is Response Maturity: the combination of Context (observability and signals), Tooling (automation and runbooks), People (on-call practice), and AI (the augmentation layer). Firms with the highest Response Maturity in their tier recover 3× faster than firms with the lowest, irrespective of which failure mode they face most often.

What this means for SREs

Industry positioning sets a floor; Response Maturity sets the ceiling. Investment should target the four components of Response Maturity rather than specialising on a single failure mode.

Chapter 3 · Incident archetypes

Most teams fit one of six incident archetypes — sticky, but with a maturation arrow

60.7% of firms stay in the same archetype from 2023 to 2025. Movers drift predictably — away from change-induced failure, toward scale, data, and substrate failure.

Archetypes: blast radius × tail risk

95 companies with ≥50 classified incidents each. Bubble size = incident share. Dataset v16.43.

Stay in same archetype

60.7%

of firms, 2023→2025

Dependency-Driven wait rate

74%

of post-mortems resolve as "wait for upstream fix" — slowest archetype

AI-Quality archetype

Emergent

Barely existed two years ago — now a distinct cluster with ~1.8h median

The six archetypes each pair a failure signature with a recovery fingerprint. Dependency-Driven teams are slow because the fix isn't theirs to push. Velocity-Driven teams break hard but own the rollback. Scale-Driven firms are the most operationally efficient — lowest major-severity share, fastest median.

Movers drift predictably: away from Velocity-Driven (change-induced failure) toward Scale-, Data-Integrity-, and Substrate-Driven (growth, data, and metal failures). The archetype you're heading toward matters as much as the one you're in today.

What this means for SREs

Identify your archetype before choosing your next reliability investment. A Dependency-Driven team should invest first in upstream-failover architecture; a Velocity-Driven team should invest in canary deploys and pre-deploy regression detection.

Chapter 4 · AI in the incident mix

AI-related incidents crossed 10% in 2026 YTD — roughly a 6× rise in three years

AI now appears across three distinct incident categories. The fastest-growing is AI agents taking destructive action on production systems — a category structurally invisible to status-page methodology.

AI share of all disclosed incidents

AI as share of unplanned disclosed incidents, 2022–2026 YTD. Source: StackGen SSOR 2026, dataset v16.43.

AI share 2026 YTD

10.7%

Up from 0.5% in 2022 — roughly a 6× rise in three years

Agent-destructive incidents

≥11

Publicly documented since mid-2025. True count unknown — invisible to status pages

AI quality growth

89×

Customer-facing AI quality incidents in 2026 YTD vs prior year, AI-native cohort

AI appears in three incident categories. Category 1: AI providers fail and downstream products cascade. Category 2: the service is up but model output is wrong or degraded — fastest-growing type, 89× in one year inside the AI-native cohort. Category 3: autonomous agents take destructive action — structurally invisible to status-page methodology.

Five of the nine documented agent-destruction events involved the agent autonomously scanning its environment for an over-scoped credential or token. The common pattern: the agent located an API token, then called a destructive mutation against production. "Don't give the agent dangerous permissions" is necessary but not sufficient — the tokens reachable from the runtime need to be inventoried.

What this means for SREs

Category 1 warrants upstream AI provider monitoring and pre-built multi-provider failover. Category 2 warrants output-quality observability before any AI feature ships. Category 3 warrants a credentials audit of every agent runtime environment — today, not after the first incident.

Chapter 5 · Cross-org cascade

More than 1 in 4 incidents we assess is a cascade you don't control

Cross-organisation cascade is the single largest failure pattern — and it resolves 3× slower than failures the operator controls. The 2026 dip reflects hero-event absence, not structural decline.

Cascade share of assessed incidents

Cross-org cascade as share of assessed incidents by year. ~16% of recent incidents not yet assessed; unattributed = not-cascade, so share is a floor. Dataset v16.43.

Cascade median MTTR

247 min

3× slower than internally-caused config failure (96 min), 2024–2026 YTD

Largest 2025 event

223 firms

AWS us-east-1 DynamoDB DNS, 20 Oct 2025 — downstream company count

AI provider cascades

2× rise

From 1.3% to 3.5% share of cascade incidents, 2025→2026 YTD

Cross-organisation cascade is the largest single failure pattern across every year of 2023–2026 YTD. The 2026 share dip from 35% to 26% reflects a single structural factor: no event has reached the downstream footprint of AWS October 2025 (223 firms) or Cloudflare November 2025 (127 firms). The modal cascade affects just one disclosing company.

Azure ran against the broader 2026 trend — cascade volume up ~1.9× Jan–May 2026, while AWS, GCP, and Cloudflare were all down. A +300% spike in Microsoft 365/Outlook cascades (mostly the 22 January Outlook outage) drove much of the Azure rise.

What this means for SREs

The single highest-share architectural investment available is pre-built failover for the top 5–10 upstream dependencies. Industry-infrastructure operators running 8–12 cascade incidents per quarter face ~50 hours per quarter of unbudgeted exposure that multi-vendor failover directly compresses.

Chapter 6 · Response maturity

The most common fix for a production incident is to wait — but it doesn't have to be

13.6% of primary remediations are "wait for upstream fix." The gap between that and automated dual-provider failover is the gap between Weak and Advanced Response Maturity.

Top primary remediations (n=361 post-mortems)

Primary remediation patterns, StackGen Post-Mortem Corpus v0.6.9, n=361 post-mortems 2023–2026.

Most common remediation

Wait

13.6% of primary remediations — ahead of restart, rollback, and bug fix

Response Maturity components

4

Context · Tooling · People · AI — AI is the most under-invested today

Best single move

Step 2 → 3

Convert pre-built failover to dual-provider hot-swap for top 5 upstream dependencies

Response Maturity has four components: Context (observability signals), Tooling (automation and runbooks), People (on-call practice), and AI (the augmentation layer). Most firms sit between Weak and Typical on at least two — the AI component is consistently the most under-invested.

The four-step cascade escalation ladder: (1) detect and wait; (2) pre-built failover with lead time; (3) dual-provider hot-swap — automatic; (4) full vendor migration. AI augments Steps 2–3 but only after Tooling pre-investment is in place. That architectural work is what converts a low-AI-applicability step into a high one.

The reason broad MTTR has been flat: most operators are still applying wait-restart-rollback patterns. The gap between catalog AI-applicability (70%+ non-trivial potential) and observed improvement is the gap between architectural readiness and prevalence.

What this means for SREs

Target two moves: convert Step 2 to Step 3 on the cascade escalation ladder for the top 5 upstream dependencies, and invest in predictive degradation modelling (RM-46) on those same dependencies. Both lift Tooling and AI components in tandem.

Chapter 7 · Looking ahead to 2027

Five predictions for where the industry sits at end-2027

The 2023–2026 patterns support five concrete predictions — and each has an implication for what to pre-invest in now.

01 · AI-Quality archetype

Becomes double-digit by end-2027

AI output quality degradation went from 1.3% to 6.3% in two years. Will likely exceed 10% of classified incidents by end-2027.

→ Invest in output-quality observability now

02 · Agent incidents

Documented count passes 25

Counts: 1, 3, 8, 7 across 2023–2026 YTD with half the year remaining. A seventh archetype (Agent-Induced) may emerge if disclosure practices catch up.

→ Inventory credentials reachable from every agent runtime

03 · Dependency-Driven

Remains largest and slowest archetype

Cascade share: 23, 32, 35, 26% across 2023–2026. The 2026 dip is hero-event absence, not structural decline.

→ Multi-vendor failover investment thesis holds

04 · Long tails

Substrate and Data-Integrity tails won't compress

Substrate-Driven P90 ~92h; Data-Integrity ~74h. Hardware and data reconciliation are time-bounded by their underlying processes.

→ Invest in rehearsal, not MTTR targets

05 · Wait-share

Grows unless upstream architectural investment grows faster

"Wait for upstream fix" is already 13.6% of primary remediations. Without upstream redundancy investment keeping pace, the wait-share rises.

→ Multi-provider hot-swap + predictive degradation modelling

The bottom line

Your Incident Profile is partly your inheritance — industry tier and archetype are sticky. Your Response Maturity is entirely your choice. The firms that lift it fastest, particularly on the AI component, will see compressed effective MTTR even when their Incident Profile stays the same.