Skip to content

State of Reliability 2026

What 174,000 incidents reveal about how online services break

The largest public analysis of online-service reliability ever assembled. We studied 174,000 real incidents across 364 companies — drawn directly from public status pages, not surveys or self-reported data — to answer one question: is online reliability getting better or worse? Download the full report for the complete findings, methodology, and industry benchmarks.

In this report
  • The most common failure modes, ranked — and why cross-company cascades are now the single biggest category of incident
  • The root causes behind the outages — and why third-party dependencies, not your own code, top the list
  • How AI-related incidents grew 6x in three years, now spanning failed AI services, model-quality issues, and autonomous agents taking destructive actions
  • Recovery time varies by service type — apps ~1.7 hrs, infrastructure 3–4 hrs, AI providers under 1 hr.
  • The six incident archetypes — find out which pattern your team matches, and why it predicts your reliability better than your industry does
21%
5%

Findings by Role

1

52%

of classified incidents are cross-company cascades, the most common pattern SREs face

2

74%

of the time, dependency-heavy teams' only fix is to wait for an upstream provider to recover

1

21%

of all incidents trace to an upstream provider you don't control

2

14%

of all incident fixes are simply "wait for upstream" — the single most common remediation

1

5%

of all incidents are now AI-related, up 6x in three years

2

61%

 of teams stay in the same incident archetype year over year — reliability patterns are predictable and improvable.

Site Reliability Engineers
Platform Engineers
Engineering Leaders

Next Steps

Get the complete State of Reliability 2026 report — 174,000 incidents across 360+ companies, the six failure archetypes, and the architectural playbook for cutting your slowest incidents.

All

Start typing to search...