INTERNAL CASE STUDY | Aiden for SRE · ObserveNow
How StackGen's Own SRE Team Runs Aiden on Production
Reducing RCA time by 80%, eliminating manual documentation, and catching capacity crises 2 hours before they become incidents — without adding headcount.
80%
Faster RCA
10+ min → under 2 min
2 hrs
Earlier Warning
Capacity crisis caught before alerting would trigger
Zero
Separate RCA Docs
Shared Chat = built-in audit trail
4
SRE Team Size
Managing multi-cloud + growing enterprise customer base
Background
StackGen is an Autonomous Operations Platform that helps enterprises manage infrastructure, DevOps, and SRE workflows through AI-powered agents. The platform includes Aiden — an agentic AI assistant — and ObserveNow, a centralised observability solution for metrics, logs, and traces.
As StackGen’s customer base grew, the internal DevOps/SRE team faced a compounding challenge: a multi-cloud Kubernetes estate (AWS, GCP, and Azure), a growing roster of enterprise customers — including SAP NS2, Chamberlain, Corcentric, and Piramal — plus 3 internal development clusters. Total environments monitored exceed the customer tenant base by ~50% when internal clusters are included.
We don’t sell Aiden to SRE teams and then manage our own production with spreadsheets and tribal knowledge. We run the same platform on our own customer environments — with the same consequences when something breaks.
Challenges
During Incidents:
Getting to the right Kubernetes cluster and establishing the correct context took 5–10 minutes per incident. Combined with log retrieval, it took a minimum of 10 minutes before any real root cause analysis could begin.
With Growing Customers:
With a growing base of customers and environments to monitor, the 4-person team was stretched thin. DevOps also had to support application developers who needed faster CI/CD and deployment support, leaving little time for proactive reliability improvements
After resolving incidents, engineers had to manually create separate documents to share root cause analysis with the team. This added overhead and slowed knowledge transfer across the small team.
Solution
Common Alert Patterns:
For the most frequently occurring alert patterns, the team created specific Aiden skills—step-by-step runbooks written in plain English that Aiden and its sub-agents execute automatically. When a skill is invoked, Aiden follows all the defined steps and returns a final answer without requiring multiple back-and-forth conversations.
via Remote Runner:
Aiden was integrated with Kubernetes clusters running StackGen workloads via a remote runner that operates inside the cluster. This sends results back to Aiden while maintaining controlled permissions and query restrictions. When metrics and traces alone can’t pinpoint the issue, Aiden agentically fetches deployment, service, and ingress-level information directly from K8s.
Shared Chat:
Once root cause analysis is complete, engineers use Aiden’s Shared Chat feature to share the full investigation with other team members. The chat serves as a complete audit trail—no need to create a separate RCA document.
What This Looks Like in Practice
Three real examples from the StackGen SRE team, run on production environments:
A “Node Not Ready” alert that would have been a rabbit hole
Normally this triggers manual investigation — kubectl commands, log checks, cross-referencing recent deployments. With Aiden, the investigation ran automatically. Within minutes, Aiden determined the node no longer existed in the cluster — it had been replaced by a healthy node. No action required.
The team could also query current node status and utilisation percentages directly in the Aiden chat, without navigating to an external dashboard. The same interface that surfaced the alert answered the follow-up questions.
Pod stuck in CrashLoopBackOff: from alert to root cause in under 2 minutes
Previously this required manually connecting to the cluster, running kubectl commands, cross-referencing logs, and often escalating to a second engineer for a second pair of eyes. The full sequence routinely took 10–15 minutes before anyone had a hypothesis.
With Aiden, the team had a diagnosis before they’d finished reading the alert.
A capacity crisis caught 2 hours early — before any alert would have fired
Aiden calculated that at the current rate, the persistent volume claim would be completely filled within 2 hours — causing a service disruption.
If the team had relied on traditional alerting, they would have received a warning with approximately 15 minutes to act. Aiden’s proactive analysis gave them a 2-hour window. The team investigated, identified the root cause, and resolved it before any impact occurred.
This is the bigger story.
Pure reactive RCA tools would not have caught this. The value wasn’t faster incident response — it was no incident at all.
Pre-built runbooks for common alert patterns. Aiden executes multi-step investigations automatically without back-and-forth.
Secure K8s integration that lets Aiden fetch deployment, service, and ingress info directly from clusters with controlled permissions.
One-click sharing of complete RCA investigations with team members. Also saved for audit compliance.
When metrics/traces are insufficient, Aiden autonomously pivots to K8s integration to gather additional context for root cause.
"The prevention story is the one that matters most. We didn’t respond faster — we didn’t respond at all, because Aiden found the problem before it became one."