The technical fix landed two hours ago. The service is running. Error rates are back to baseline. And yet the incident is still open \u2014 because somewhere in the background, a queue is draining, a data backfill is running, or a consistency check is working through millions of rows.
This is phased data recovery (FM-21), the third most common failure mode at 11% of classified unplanned incidents in 2025. Duration isn't driven by how hard the technical problem is. It's driven by how much data needs to be processed after the fix.
Full data: stackgen.com/state-of-reliability.
What Is Phased Data Recovery?
The operator keeps the incident open while working through data-state consequences: queue backfill, data replay, consistency reconciliation, cache warming. From the user's perspective, the service is degraded throughout \u2014 from the original failure to completion of the recovery job.
The Data: Multi-Day Duration Examples
- Datadog \u2014 June 2025: underlying infrastructure issue resolved in hours. Incident ran 2.1 days while processing the metrics backlog.
- OpenAI \u2014 July 2025: Compliance API Data Delays. API recovered quickly. Incident ran 5.4 days ensuring data completeness across the affected time window.
- Dropbox \u2014 September 2024: mobile rollout state inconsistency. Rollout halted and reversed quickly. Reconciling state took 14.8 days working through affected accounts.
Why This Failure Mode Is Underestimated
Duration metrics are misleading. A dashboard showing MTTR of 3 hours (when the technical fix landed) while actual user impact ran 48 hours means FM-21 cost is systematically undercounted.
The recovery phase feels like cleanup. Once the acute phase ends, pressure drops. This is when FM-21 incidents silently extend \u2014 because nobody is driving recovery to completion with the same urgency as the technical fix.
What Good Recovery Looks Like
- Recovery job monitoring as part of the incident: queue depth, backfill progress, consistency check completion \u2014 treated as first-class incident metrics, not background noise
- Explicit closure criteria upfront: \u201cincident closes when: error rates at baseline AND queue depth < 1,000 AND replication lag < 5 minutes\u201d
- Communicate the recovery timeline to users: FM-21 incidents are unusually predictable once recovery starts. If you have 4M events to backfill at 200K/hour, publish that ETA.
Key Takeaways
- 11% of 2025 incidents \u2014 the dominant driver of multi-day incident duration
- Duration is driven by data volume, not technical complexity
- Datadog (2.1d), OpenAI (5.4d), Dropbox (14.8d) show how FM-21 scales with data complexity
- Recovery job monitoring is the biggest MTTR lever \u2014 treat backfill progress as a first-class incident metric
stackgen.com/state-of-reliability | LinkedIn webinar