SRE Config-Induced Failures: The Incident That Starts With "Nothing Changed"

Written by John Jamie | Jun 24, 2026 9:32:41 PM

“We didn't deploy anything.” It's one of the most common things an on-call SRE says at the start of an incident — and one of the most misleading. Because while no code deployed, something almost certainly changed.

Config-induced failure (FM-10) is 9% of classified unplanned incidents in 2025. It's structurally similar to deploy-induced regression (FM-09) but harder to detect — the change management trail is weaker, and modern IaC tooling has dramatically larger blast radius when it goes wrong.

Full data: stackgen.com/state-of-reliability-2026.

What Is Config-Induced Failure in SRE?

Any incident triggered by a non-code change: a configuration value, feature flag, environment variable, IAM policy, ACL, network rule, DNS record, quota adjustment, or capacity policy update. Distinct from FM-09 in one key way: the change is often not in the same audit trail as code deploys.

The Two High-Profile Cases

Cloudflare — November 18, 2025

A ClickHouse permissions update caused a Bot Management feature file to double in size. When the main Cloudflare proxy received the updated config, it crashed — affecting 56 downstream companies in the SSOR dataset. The change was innocuous-seeming: a database permissions update with an unexpected downstream effect on file size with an unexpected downstream effect on proxy behavior. Three hops, each fine in isolation.

AWS us-east-1 — October 20, 2025

A DNS automation race condition was triggered by an automated config write — not a human clicking in the console. An infrastructure automation process writing a config value created a race condition that produced an empty DNS record. This is the IaC-era shape of FM-10: lower frequency than manual config changes, but dramatically higher blast radius.

The IaC Paradox

Before IaC: Config changes were frequent, manual, often undocumented. High frequency, low blast radius, weak audit trail.

With IaC: Config changes are less frequent, version-controlled, reviewable. Lower frequency, much higher blast radius — one Terraform apply can atomically modify dozens of security groups, IAM policies, DNS records, and routing rules.

The answer isn't less IaC. It's more rigorous change review and blast-radius-aware deployment strategy for IaC changes.

Why “Nothing Changed” Is Almost Never True

Automated config writes: infrastructure automation, self-healing systems write config values constantly
Third-party vendor changes: your vendor updated their API behavior, changed a default, or deprecated an endpoint
Certificate expirations: a time-bound config validity that expires
Quota / limit adjustments: cloud provider changes that don't show up in your deployment tooling

Key Takeaways

9% of 2025 incidents — systematically harder to detect than deploy regressions because the change trail is fragmented
Cloudflare Nov 2025 and AWS Oct 2025 are the clearest high-impact FM-10 examples: both trace to config changes with unexpected cascading consequences
IaC expands blast radius: the same rigorous rollout discipline you apply to code deploys should apply to IaC changes
The highest-leverage investment: change-data integration — surfacing all config change signals in the same telemetry stream as your alerts and metrics

stackgen.com/state-of-reliability-2026 | LinkedIn webinar

View full post