The SRE Incident That Won't Close: Understanding Phased Data Recovery

Written by John Jamie | Jun 24, 2026 9:32:51 PM

The technical fix landed two hours ago. The service is running. Error rates are back to baseline. And yet the SRE incident is still open — because somewhere in the background, a queue is draining, a data backfill is running, or a consistency check is working through millions of rows.

This is phased data recovery (FM-21), the third most common failure mode at 11% of classified unplanned incidents in 2025. Duration isn't driven by how hard the technical problem is. It's driven by how much data needs to be processed after the fix.

Full data: stackgen.com/state-of-reliability-2026.

What Is Phased Data Recovery?

The operator keeps the incident open while working through data-state consequences: queue backfill, data replay, consistency reconciliation, cache warming. From the user's perspective, the service is degraded throughout — from the original failure to completion of the recovery job.

The Data: Multi-Day Duration Examples

Datadog — June 2025: underlying infrastructure issue resolved in hours. Incident ran 2.1 days while processing the metrics backlog.
OpenAI — July 2025: Compliance API Data Delays. API recovered quickly. Incident ran 5.4 days ensuring data completeness across the affected time window.
Dropbox — September 2024: mobile rollout state inconsistency. Rollout halted and reversed quickly. Reconciling state took 14.8 days working through affected accounts.

Why SRE Teams Underestimate This Failure Mode

Duration metrics are misleading. A dashboard showing MTTR of 3 hours (when the technical fix landed) while actual user impact ran 48 hours means FM-21 cost is systematically undercounted.

The recovery phase feels like cleanup. Once the acute phase ends, pressure drops. This is when FM-21 incidents silently extend — because nobody is driving recovery to completion with the same urgency as the technical fix.

What Good Recovery Looks Like

Recovery job monitoring as part of the incident: queue depth, backfill progress, consistency check completion — treated as first-class incident metrics, not background noise
Explicit closure criteria upfront: “incident closes when: error rates at baseline AND queue depth < 1,000 AND replication lag < 5 minutes”
Communicate the recovery timeline to users: FM-21 incidents are unusually predictable once recovery starts. If you have 4M events to backfill at 200K/hour, publish that ETA.

Key Takeaways

11% of 2025 incidents — the dominant driver of multi-day incident duration
Duration is driven by data volume, not technical complexity
Datadog (2.1d), OpenAI (5.4d), Dropbox (14.8d) show how FM-21 scales with data complexity
Recovery job monitoring is the biggest MTTR lever — treat backfill progress as a first-class incident metric

stackgen.com/state-of-reliability-2026 | LinkedIn webinar

View full post