SRE Resource Exhaustion: The Incident Pattern That Looks Different Every Time
Connection pool hit zero. Memory leaked to the ceiling. Disk filled with logs overnight. GPU capacity ran out under inference load.
Resource exhaustion (FM-13) looks different on the surface every time — CPU, memory, disk, connections, tokens, GPU compute — but for SRE teams, the shape is always the same: a finite resource consumed faster than it can be replenished. It's 11% of classified unplanned incidents in 2025, and one of the most automatable patterns in the taxonomy.
Full data: stackgen.com/state-of-reliability-2026.
Two Sub-Patterns, Different Responses
Demand Saturation: traffic spike exceeds capacity. The resource is correctly sized; the load grew unexpectedly. Response: scale out, expand capacity, increase quota.
Resource Leak / Runaway Consumption: pathological growth uncorrelated with traffic — memory leak, queue runaway, log disk fill. Response: restart and fix. Confusing these two leads to adding capacity to a leaking system, watching it fill up again.
Where FM-13 Appears in the Data
- Communications / infrastructure services: high connection-count services (Twilio, Bandwidth, Sinch) show FM-13 frequently, particularly connection pool exhaustion under traffic spikes
- AI model providers: GPU capacity exhaustion under inference load is an emerging FM-13 variant. When an AI service can't handle inference load and returns degraded quality, that crosses into FM-17. When it simply fails requests, it stays FM-13.
- Crypto operators: queue exhaustion during high transaction volume periods (major market moves, new chain launches) appears repeatedly in Kraken, Coinbase, and Luno incident histories
Why FM-13 Is SRE’s Highest-Automation Pattern
- Detection: resource utilization chart hits ceiling (CPU 100%, connections 0, disk 100%, OOM events)
- Diagnosis: demand saturation or leak? (monotonic growth vs. traffic-correlated spike)
- Remediation: demand → scale out. Leak → restart + fix.
All three steps are automatable from standard telemetry. No business-logic judgment required for the initial response. This is why FM-13 is Tier C — AI-Closed — in the autonomy framework.
Prevention: Alert Early
- Alert at 75%, not 95%: gives time to respond proactively before the incident
- Track P99 resource consumption, not just mean: peak consumption is what causes exhaustion
- Separate demand saturation and leak alerting: monotonic growth over 24 hours is a different signal from a traffic-correlated spike
Key Takeaways
- 11% of 2025 incidents — high frequency, high automation potential
- Two sub-patterns require different responses: confusing demand saturation with leaks makes incidents worse
- AI inference is an emerging FM-13 vector: GPU capacity exhaustion growing as AI features embed in products
- FM-13 is Tier C (AI-Closed) — the most automatable pattern in the taxonomy
About StackGen:
StackGen is the pioneer in Autonomous Infrastructure Platform (AIP) technology, helping enterprises transition from manual Infrastructure-as-Code (IaC) management to fully autonomous operations. Founded by infrastructure automation experts and headquartered in the San Francisco Bay Area, StackGen serves leading companies across technology, financial services, manufacturing, and entertainment industries.