The Incident That Won't Close: Understanding Phased Data Recovery

Written by John Jamie | Jun 20, 2026 2:02:40 AM

The technical fix landed two hours ago. The service is running. Error rates are back to baseline. And yet the incident is still open \u2014 because somewhere in the background, a queue is draining, a data backfill is running, or a consistency check is working through millions of rows.

This is phased data recovery (FM-21), the third most common failure mode at 11% of classified unplanned incidents in 2025. Duration isn't driven by how hard the technical problem is. It's driven by how much data needs to be processed after the fix.

Full data: stackgen.com/state-of-reliability.

What Is Phased Data Recovery?

The operator keeps the incident open while working through data-state consequences: queue backfill, data replay, consistency reconciliation, cache warming. From the user's perspective, the service is degraded throughout \u2014 from the original failure to completion of the recovery job.

The Data: Multi-Day Duration Examples

Datadog \u2014 June 2025: underlying infrastructure issue resolved in hours. Incident ran 2.1 days while processing the metrics backlog.
OpenAI \u2014 July 2025: Compliance API Data Delays. API recovered quickly. Incident ran 5.4 days ensuring data completeness across the affected time window.
Dropbox \u2014 September 2024: mobile rollout state inconsistency. Rollout halted and reversed quickly. Reconciling state took 14.8 days working through affected accounts.

Why This Failure Mode Is Underestimated

Duration metrics are misleading. A dashboard showing MTTR of 3 hours (when the technical fix landed) while actual user impact ran 48 hours means FM-21 cost is systematically undercounted.

The recovery phase feels like cleanup. Once the acute phase ends, pressure drops. This is when FM-21 incidents silently extend \u2014 because nobody is driving recovery to completion with the same urgency as the technical fix.

What Good Recovery Looks Like

Recovery job monitoring as part of the incident: queue depth, backfill progress, consistency check completion \u2014 treated as first-class incident metrics, not background noise
Explicit closure criteria upfront: \u201cincident closes when: error rates at baseline AND queue depth < 1,000 AND replication lag < 5 minutes\u201d
Communicate the recovery timeline to users: FM-21 incidents are unusually predictable once recovery starts. If you have 4M events to backfill at 200K/hour, publish that ETA.

Key Takeaways

11% of 2025 incidents \u2014 the dominant driver of multi-day incident duration
Duration is driven by data volume, not technical complexity
Datadog (2.1d), OpenAI (5.4d), Dropbox (14.8d) show how FM-21 scales with data complexity
Recovery job monitoring is the biggest MTTR lever \u2014 treat backfill progress as a first-class incident metric

stackgen.com/state-of-reliability | LinkedIn webinar

View full post