Negative MTTD: The Most Important SRE Metric for the Next 36 Months

Written by Nikhil Ravindran | May 26, 2026 10:31:41 AM

The 3 AM Problem No One Is Actually Solving

You know the pattern. It's 3 AM and your pager fires. A service is degrading: latency trending up, error rate creeping past threshold. You open a Slack thread, assemble the people who know the affected system, and start the archaeology: logs, traces, dashboards, the last three deploys. Forty five minutes later you've found the root cause. Another thirty to remediate. You write the incident report the next morning, add a postmortem action item that may or may not get prioritised, and wait for the next one.

This is toil in its purest form. Not the work of building, the work of fixing, repeatedly, under pressure, at hours that erode the people doing it. According to Google's SRE research, toil is any work that is manual, repetitive, automatable, tactical, and devoid of enduring value. Incident response, as most teams practise it today, scores five for five.

The SRE community has spent years getting better at the recovery half of this loop. AI-assisted triage, automated runbooks, intelligent alert correlation; these have genuinely helped. The SolarWinds 2025 State of ITSM report, drawn from over 60,000 incidents, found that organisations using GenAI tooling resolve incidents roughly 30% faster. That's real progress.

But it is optimising the wrong half of the problem. Faster recovery means faster resolution of incidents that still happened. The 3 AM page still fires. The Slack thread still assembles. The archaeology still runs; just slightly less of it. The toil is reduced at the margins, not at the source.

The source is the incident itself. The question the industry hasn't seriously asked is: how many of these incidents needed to happen at all?

Why MTTR Alone Is Not Enough

Gartner put the structural problem plainly in their AI SRE Tooling Market Guide: organisations that select AI tooling focused only on operations will get better at reactively fixing incidents; but not at improving system reliability.

The practical implication: you can invest heavily in AI-assisted incident response and come out the other side with a team that is extremely good at reacting and still not reduce your incident count, your on-call burden, or your toil. Because reliability isn't just response velocity. It's the frequency and severity of failures in the first place.

The industry has been measuring Mean Time to Detection as a positive number: how long after an incident begins before you detect it. For most teams, the answer is minutes to hours after the failure has started affecting users. Detection is trailing the incident. By definition, you're already in the war room before you know it's a war room.

What if your team routinely acted before the incident clock even started?

Introducing Negative MTTD

Negative MTTD is not a redefinition of a formula. It's a description of a regime and a different philosophy of what SRE work should look like.

Classical MTTD: the average time between incident start and detection; will always be positive, because it only applies to incidents that actually occurred. That formula is correct and should stay unchanged.

Negative MTTD describes the operating state where your team is routinely detecting and acting on failure conditions before they cross the impact threshold, before t_incident_start would have existed at all. The 3 AM page doesn't fire, not because your on-call response got faster, but because the condition that would have caused it was intercepted at 11 PM by a system that saw it coming.

This is shift-left applied to incident management, the same principle that transformed software quality when defect detection moved from production to development. Defects caught earlier cost dramatically less. Incidents intercepted before they become incidents cost nothing: no war room, no postmortem, no erosion of on-call resilience.

The Two Distributions: Visualising the Shift

Imagine two timeline charts. The x-axis runs from "before the incident" on the left to "after the incident" on the right, with the failure threshold roughly a third of the way along. The y-axis shows when detections occur across all incidents in a period.

Today's distribution is skewed to the right. Detection happens after the incident is underway, sometimes seconds after, sometimes minutes, sometimes when a customer tweets. Detecting before the failure threshold is crossed is the exception, not the norm.

The target distribution is shifted left. Most detections happen before the failure threshold is crossed. You're acting on early warning signals, degrading resource trajectories, anomalous traffic shapes, dependency health patterns that historically precede cascades; before they become pages.

Every detection on the left side of the chart is a potential incident removed from your count entirely. Not recovered from faster; not occurred at all.

Defining True Avoidance, and Why the Bar Should Be High

Before quantifying avoidance, precision matters. Trivial cases don't count.

A system that would eventually exhaust disk space given infinite time, and which gets routine quarterly maintenance, hasn't "avoided an incident" in any meaningful sense. True avoidance requires three things:

A leading indicator causally linked to a specific failure mode; not merely correlated
A detection event that fires within an actionable intervention window before the failure threshold is crossed
An intervention that genuinely prevents the failure, not merely delays it to the next cycle

This is a deliberately high bar. It prevents gaming the metric by counting scheduled maintenance as proactive detection. It's also what makes the metric meaningful: when your team is achieving true avoidance at scale, the 3 AM page count drops, on-call burden drops, and toil drops at the source rather than at the resolution stage.

How Much Is Actually Preventable? What the Evidence Shows

No large-scale empirical study has directly measured what percentage of IT/software incidents are caught before failure thresholds today. That absence is itself the finding; the industry has never rigorously measured its own proactive detection rate, which is partly why so few teams optimise for it.

The best available evidence comes from adjacent domains. The Uptime Institute's 2025 Annual Outage Analysis found that 80% of operators believed their most recent downtime incident could have been prevented with better management, processes, or configuration. Of human-error-related outages, which represent 66–80% of all downtime incidents; 58% stemmed from staff failing to follow established procedures. Uptime's own characterisation: "completely preventable."

Healthcare systematic reviews find 43–55% of adverse events are preventable, replicated across studies covering tens of thousands of patient records. Industrial predictive maintenance programmes achieve 70–75% elimination of unexpected breakdowns through mature AI-driven prediction. These domains are generally considered harder to predict than software failures, which makes the software case stronger, not weaker.

Combining these analogues with our own State of Enterprise Reliability dataset analysis:

These are estimates, not measurements. The honest position is that the SRE industry lacks a rigorous baseline for this number — and that gap is exactly what the Negative MTTD framework is designed to close.

A Rigorous Framework: Three Metrics

"Negative MTTD" is the narrative concept. The metrics underneath it are what you actually instrument.

Metric 1: Lead Time to Incident (LTI)

Where t_predict is when the predictive signal fired and t_impact is the estimated time the failure would have crossed the impact threshold. A 41-minute average LTI means your team is acting 41 minutes before the failure would have started generating toil.

Metric 2: Prediction Coverage

Where T is your minimum actionable lead time — typically 15 minutes for automated responses, 30 minutes for human-in-the-loop. Coverage tells you how broadly your predictive capability applies across your incident portfolio, not just how good it is on the cases it catches.

Metric 3: Classical MTTD; unchanged

Classical MTTD applies to incidents that were not predicted, or where prediction failed. It measures your reactive floor. It does not compete with LTI — it complements it.

The Breakeven Formula: When Does Negative MTTD Actually Kick In?

The naive version of the Negative MTTD condition would be: LTI > MTTD. But this is wrong: LTI covers both actual and avoided incidents, while MTTD only covers actual incidents. The correct framing weights across both classes:

Where p is the prevention rate. Setting this to zero gives the breakeven:

Where PLR (Prediction Leverage Ratio) = LTI / MTTD. The higher your PLR, the less prevention coverage you need to cross the zero line:

The last row is the cautionary case: if your prediction lead times are shorter than your MTTD, you need to prevent the vast majority of incidents just to break even. Improving LTI is almost always the higher-leverage investment, and it directly reduces the toil generated by each incident that LTI displaces.

The Negative MTTD Landscape

The landscape makes the formula visual. The white zero-MTTD boundary is the hyperbola p* = 1/(1+PLR): curving steeply at low LTI, flattening at high LTI. Moving up (longer LTI) or right (higher prevention) always improves position. The three journey dots show the path from where most teams sit today, deep in the red, reactive, generating significant 3 AM toil; to a target state where the majority of incidents are intercepted before they become incidents at all.

Why Your Observability Stack Is Working Against You

No major observability platform natively supports logging pre-incident detection lead times as a first-class metric. Not Datadog. Not Dynatrace. Not New Relic, Prometheus, Grafana, or PagerDuty. Every platform was architected around one assumption: something has already gone wrong.

This is a structural bias, not a configuration gap. Datadog's incident schema contains detected, customer_impact_start, and resolved; all past-tense. There is no predicted_failure_time field. Dynatrace's problem record stores startTime and endTime with nothing distinguishing predictive from reactive detection.

The deeper irony: predictive alerting already exists in most of these platforms. Prometheus's predict_linear() fires alerts when disk space is projected to exhaust within a defined window. New Relic's Predictive Alerts forecast metric trajectories. Splunk ITSI trains ML models to predict service health degradation 30 minutes ahead. But when any of these alerts fires, it enters the incident pipeline as a binary notification — the predicted breach time, the model confidence, and the lead time calculation are all discarded.

Dynatrace's "Preventive Operations" announcement in early 2025 claimed the goal was "preventing problems from occurring in the first place." The underlying capability — Davis Forecast — does produce forward-looking time series, but runs as a manual scheduled step rather than an integrated component of the detection pipeline. Events generated from forecasts carry no predicted failure timestamp and are indistinguishable in the data model from any standard reactive alert.

Industrial predictive maintenance has measured pre-failure detection lead times as a primary output metric for decades under the concept of "Remaining Useful Life." Software SRE has borrowed almost nothing from this tradition.

Instrumenting LTI Today: A Starting Point

You don't need to wait for the observability vendors to catch up. Here is a minimal working approach using Prometheus and a custom metrics layer.

Step 1 — Capture the predicted breach timestamp when the alert fires

Prometheus's predict_linear() returns a forecast value but not a timestamp. Compute the breach time in your alerting rule annotations:

Step 2 — Write the prediction metadata to a trackable store

Step 3 — Compute and track signed LTI

This gives you a signed LTI distribution in your existing metrics backend — positive values are predictions that fired ahead of failure, negative values are reactive detections. The median of this distribution is your effective system MTTD. Watch the distribution shift left over time as your predictive coverage improves.

The same pattern applies to any predictive alerting source: New Relic Predictive Alerts carry a lookAheadSeconds parameter; Splunk ITSI's predicted events carry a predicted_health_score timeline; Dynatrace Davis Forecast outputs can be captured via the Workflow API.

Aiden for SRE: Built for the Left Side of the Timeline

The three layers of a Negative MTTD programme: instrumentation quality, predictive signal models, and pre-incident automation; require each other to work. Comprehensive telemetry without prediction models is just better dashboards. Prediction models without automated response are early warnings that still require a 3 AM human. The value is in the combination.

Aiden for SRE was built around this architecture. On resource exhaustion signals, capacity, memory, connection pools, disk, Aiden's forecasting models continuously project breach trajectories rather than comparing current state against static threshold. This is the highest-confidence predictable incident category: because resource depletion follows observable trends over minutes to hours, LTI on these signals is measured in that same range.

The prediction doesn't need to be perfect. It needs to be early enough, and confident enough, to act on. That's what distinguishes LTI as a metric from a threshold breach: it carries both a lead time and a confidence score, and both matter for deciding whether to trigger automation. When both exceed the minimum threshold, Aiden triggers the remediation playbook. The 3 AM page never fires, because the intervention ran at 11 PM.

This is where most teams get stuck, and not because of the models. Charity Majors, co-founder and CTO of Honeycomb, named the root problem precisely: "The quality ceiling of your automation is set by the quality of the data going in, not the intelligence of the model. AI agents turn up their noses at three-pillars-style telemetry." The prerequisite for a negative MTTD programme isn't a better prediction model; it's instrumentation that gives any model something worth predicting from.

Conclusion

The SRE community has made genuine progress on the recovery half of the incident loop. MTTR is down. AI-assisted triage works. Detection has gotten faster.

But the 3 AM page still fires. The Slack thread still assembles. The toil is still generating, at the source, with the same relentlessness.

The next performance frontier isn't faster recovery. It's collapsing the incident queue itself — shifting the detection distribution far enough to the left that a significant fraction of what would have been incidents simply never becomes incidents. That's what Negative MTTD looks like in practice. That's what Lead Time to Incident measures. And that's what the breakeven formula makes precise: the combination of prediction quality and prevention coverage that takes a team from net-reactive to net-predictive.

The metrics don't yet appear on any standard SRE dashboard. The observability vendors haven't built the data model to support them. No industry benchmark exists for what the typical proactive detection rate is today.

All of that is a gap, and a starting point.

The question isn't just: how fast do we fix it?
The question is: how often do we see it coming?

View full post