The technical fix landed two hours ago. The service is running. Error rates are back to baseline. And yet the incident is still open \u2014 because somewhere in the background, a queue is draining, a data backfill is running, or a consistency check is working through millions of rows.
This is phased data recovery (FM-21), the third most common failure mode at 11% of classified unplanned incidents in 2025. Duration isn't driven by how hard the technical problem is. It's driven by how much data needs to be processed after the fix.
Full data: stackgen.com/state-of-reliability.
What Is Phased Data Recovery?
The operator keeps the incident open while working through data-state consequences: queue backfill, data replay, consistency reconciliation, cache warming. From the user's perspective, the service is degraded throughout \u2014 from the original failure to completion of the recovery job.
The Data: Multi-Day Duration Examples
- Datadog \u2014 June 2025: underlying infrastructure issue resolved in hours. Incident ran 2.1 days while processing the metrics backlog.
- OpenAI \u2014 July 2025: Compliance API Data Delays. API recovered quickly. Incident ran 5.4 days ensuring data completeness across the affected time window.
- Dropbox \u2014 September 2024: mobile rollout state inconsistency. Rollout halted and reversed quickly. Reconciling state took 14.8 days working through affected accounts.
Why This Failure Mode Is Underestimated
Duration metrics are misleading. A dashboard showing MTTR of 3 hours (when the technical fix landed) while actual user impact ran 48 hours means FM-21 cost is systematically undercounted.
The recovery phase feels like cleanup. Once the acute phase ends, pressure drops. This is when FM-21 incidents silently extend \u2014 because nobody is driving recovery to completion with the same urgency as the technical fix.
What Good Recovery Looks Like
- Recovery job monitoring as part of the incident: queue depth, backfill progress, consistency check completion \u2014 treated as first-class incident metrics, not background noise
- Explicit closure criteria upfront: \u201cincident closes when: error rates at baseline AND queue depth < 1,000 AND replication lag < 5 minutes\u201d
- Communicate the recovery timeline to users: FM-21 incidents are unusually predictable once recovery starts. If you have 4M events to backfill at 200K/hour, publish that ETA.
Key Takeaways
- 11% of 2025 incidents \u2014 the dominant driver of multi-day incident duration
- Duration is driven by data volume, not technical complexity
- Datadog (2.1d), OpenAI (5.4d), Dropbox (14.8d) show how FM-21 scales with data complexity
- Recovery job monitoring is the biggest MTTR lever \u2014 treat backfill progress as a first-class incident metric
About StackGen:
StackGen is the pioneer in Autonomous Infrastructure Platform (AIP) technology, helping enterprises transition from manual Infrastructure-as-Code (IaC) management to fully autonomous operations. Founded by infrastructure automation experts and headquartered in the San Francisco Bay Area, StackGen serves leading companies across technology, financial services, manufacturing, and entertainment industries.