Skip to content

SRE Config-Induced Failures: The Incident That Starts With "Nothing Changed"

Author:
John Jamie | Jun 24, 2026
Topics

Share This:

“We didn't deploy anything.” It's one of the most common things an on-call SRE says at the start of an incident — and one of the most misleading. Because while no code deployed, something almost certainly changed.

Config-induced failure (FM-10) is 9% of classified unplanned incidents in 2025. It's structurally similar to deploy-induced regression (FM-09) but harder to detect — the change management trail is weaker, and modern IaC tooling has dramatically larger blast radius when it goes wrong.

Full data: stackgen.com/state-of-reliability-2026.

What Is Config-Induced Failure in SRE?

Any incident triggered by a non-code change: a configuration value, feature flag, environment variable, IAM policy, ACL, network rule, DNS record, quota adjustment, or capacity policy update. Distinct from FM-09 in one key way: the change is often not in the same audit trail as code deploys.

The Two High-Profile Cases

Cloudflare — November 18, 2025

A ClickHouse permissions update caused a Bot Management feature file to double in size. When the main Cloudflare proxy received the updated config, it crashed — affecting 56 downstream companies in the SSOR dataset. The change was innocuous-seeming: a database permissions update with an unexpected downstream effect on file size with an unexpected downstream effect on proxy behavior. Three hops, each fine in isolation.

AWS us-east-1 — October 20, 2025

A DNS automation race condition was triggered by an automated config write — not a human clicking in the console. An infrastructure automation process writing a config value created a race condition that produced an empty DNS record. This is the IaC-era shape of FM-10: lower frequency than manual config changes, but dramatically higher blast radius.

The IaC Paradox

Before IaC: Config changes were frequent, manual, often undocumented. High frequency, low blast radius, weak audit trail.

With IaC: Config changes are less frequent, version-controlled, reviewable. Lower frequency, much higher blast radius — one Terraform apply can atomically modify dozens of security groups, IAM policies, DNS records, and routing rules.

The answer isn't less IaC. It's more rigorous change review and blast-radius-aware deployment strategy for IaC changes.

Why “Nothing Changed” Is Almost Never True

  • Automated config writes: infrastructure automation, self-healing systems write config values constantly
  • Third-party vendor changes: your vendor updated their API behavior, changed a default, or deprecated an endpoint
  • Certificate expirations: a time-bound config validity that expires
  • Quota / limit adjustments: cloud provider changes that don't show up in your deployment tooling

Key Takeaways

  • 9% of 2025 incidents — systematically harder to detect than deploy regressions because the change trail is fragmented
  • Cloudflare Nov 2025 and AWS Oct 2025 are the clearest high-impact FM-10 examples: both trace to config changes with unexpected cascading consequences
  • IaC expands blast radius: the same rigorous rollout discipline you apply to code deploys should apply to IaC changes
  • The highest-leverage investment: change-data integration — surfacing all config change signals in the same telemetry stream as your alerts and metrics

stackgen.com/state-of-reliability-2026 | LinkedIn webinar

About StackGen:

StackGen is the pioneer in Autonomous Infrastructure Platform (AIP) technology, helping enterprises transition from manual Infrastructure-as-Code (IaC) management to fully autonomous operations. Founded by infrastructure automation experts and headquartered in the San Francisco Bay Area, StackGen serves leading companies across technology, financial services, manufacturing, and entertainment industries.

All

Start typing to search...