INTERNAL CASE STUDY | Aiden for SRE · ObserveNow

How StackGen's Own SRE Team Runs Aiden on Production

Reducing RCA time by 80%, eliminating manual documentation, and catching capacity crises 2 hours before they become incidents — without adding headcount.

results

80%

Faster RCA
10+ min → under 2 min

2 hrs

Earlier Warning
Capacity crisis caught before alerting would trigger

Zero

Separate RCA Docs
Shared Chat = built-in audit trail

4

SRE Team Size
Managing multi-cloud + growing enterprise customer base

4 SREs. 3 Clouds. A Growing Enterprise Customer List.-1

Background

StackGen is an Autonomous Operations Platform that helps enterprises manage infrastructure, DevOps, and SRE workflows through AI-powered agents. The platform includes Aiden — an agentic AI assistant — and ObserveNow, a centralised observability solution for metrics, logs, and traces.

As StackGen’s customer base grew, the internal DevOps/SRE team faced a compounding challenge: a multi-cloud Kubernetes estate (AWS, GCP, and Azure), a growing roster of enterprise customers — including SAP NS2, Chamberlain, Corcentric, and Piramal — plus 3 internal development clusters. Total environments monitored exceed the customer tenant base by ~50% when internal clusters are included.

We don’t sell Aiden to SRE teams and then manage our own production with spreadsheets and tribal knowledge. We run the same platform on our own customer environments — with the same consequences when something breaks.

Challenges

Vector-Apr-16-2026-12-14-41-2769-PM
Slow Time-to-Environment
During Incidents:


Getting to the right Kubernetes cluster and establishing the correct context took 5–10 minutes per incident. Combined with log retrieval, it took a minimum of 10 minutes before any real root cause analysis could begin.

improvement 1-1
High Operational Load
With Growing Customers:


With a growing base of customers and environments to monitor, the 4-person team was stretched thin. DevOps also had to support application developers who needed faster CI/CD and deployment support, leaving little time for proactive reliability improvements

Vector (1)-Apr-17-2026-08-33-56-8139-AM
Manual RCA Documentation:


After resolving incidents, engineers had to manually create separate documents to share root cause analysis with the team. This added overhead and slowed knowledge transfer across the small team.

Solution

Vector (2)-Apr-17-2026-08-34-24-9423-AM
Pre-Built Skills for
Common Alert Patterns:


For the most frequently occurring alert patterns, the team created specific Aiden skills—step-by-step runbooks written in plain English that Aiden and its sub-agents execute automatically. When a skill is invoked, Aiden follows all the defined steps and returns a final answer without requiring multiple back-and-forth conversations.

floor-alt 1
Direct Kubernetes Integration
via Remote Runner:


Aiden was integrated with Kubernetes clusters running StackGen workloads via a remote runner that operates inside the cluster. This sends results back to Aiden while maintaining controlled permissions and query restrictions. When metrics and traces alone can’t pinpoint the issue, Aiden agentically fetches deployment, service, and ingress-level information directly from K8s.

source-document 1
Shareable RCA via
Shared Chat:


Once root cause analysis is complete, engineers use Aiden’s Shared Chat feature to share the full investigation with other team members. The chat serves as a complete audit trail—no need to create a separate RCA document.

What This Looks Like in Practice

Three real examples from the StackGen SRE team, run on production environments:

SCENARIO 1 · ALERT TRIAGE

A “Node Not Ready” alert that would have been a rabbit hole

The team received an alert: “Inside cluster developer-eks, node ip-10-0-1-104.us-west-2.compute.internal: node not ready for more than 10 minutes.”

Normally this triggers manual investigation — kubectl commands, log checks, cross-referencing recent deployments. With Aiden, the investigation ran automatically. Within minutes, Aiden determined the node no longer existed in the cluster — it had been replaced by a healthy node. No action required.

The team could also query current node status and utilisation percentages directly in the Aiden chat, without navigating to an external dashboard. The same interface that surfaced the alert answered the follow-up questions.
SCENARIO 2 · INCIDENT RESPONSE

Pod stuck in CrashLoopBackOff: from alert to root cause in under 2 minutes

A pod entered CrashLoopBackOff. Aiden immediately pulled recent Kubernetes events for the pod, retrieved pod logs for error details, and identified the likely cause — a database migration script — in under 2 minutes.

Previously this required manually connecting to the cluster, running kubectl commands, cross-referencing logs, and often escalating to a second engineer for a second pair of eyes. The full sequence routinely took 10–15 minutes before anyone had a hypothesis.

With Aiden, the team had a diagnosis before they’d finished reading the alert.
SCENARIO 3 · INCIDENT PREVENTION

A capacity crisis caught 2 hours early — before any alert would have fired

While debugging a separate issue, Aiden surfaced a dashboard showing combined ingestion log volume from a service. The ingestion rate was growing fast.

Aiden calculated that at the current rate, the persistent volume claim would be completely filled within 2 hours — causing a service disruption.

If the team had relied on traditional alerting, they would have received a warning with approximately 15 minutes to act. Aiden’s proactive analysis gave them a 2-hour window. The team investigated, identified the root cause, and resolved it before any impact occurred.

This is the bigger story.
Pure reactive RCA tools would not have caught this. The value wasn’t faster incident response — it was no incident at all.
BEFORE Vs AFTER
Before WITH AIDEN + OBSERVENOW
10–15 min to reach root cause on a typical alert
Under 2 min to root cause — Aiden fetches K8s events, logs, and deployment data automatically
Manual context-switching across 3 clouds to find the right cluster
Direct K8s integration via Remote Runner — no manual cluster navigation
Post-incident RCA docs created manually after every incident
Shared Chat history = complete, shareable RCA with zero extra effort
Reactive: alerts arrive, team scrambles, archaeology begins
Agentic: Aiden pivots from metrics to K8s context autonomously
Capacity risks only visible after alerting threshold crossed (~15 min warning)
2-hour advance warning on capacity crises — before any alert would have fired
Alert noise with no automatic triage for node-not-ready false positives
False positives auto-resolved: Aiden confirmed a “node not ready” alert was a replaced (healthy) node
Platform Capabilities Used
Group 1321319614-1 Skills

Pre-built runbooks for common alert patterns. Aiden executes multi-step investigations automatically without back-and-forth.

Group 1321319616 Remote Runner

Secure K8s integration that lets Aiden fetch deployment, service, and ingress info directly from clusters with controlled permissions.

Group 1321319615 Shared Chat

One-click sharing of complete RCA investigations with team members. Also saved for audit compliance.

Group 1321319617 Agentic Behavior

When metrics/traces are insufficient, Aiden autonomously pivots to K8s integration to gather additional context for root cause.

"The prevention story is the one that matters most. We didn’t respond faster — we didn’t respond at all, because Aiden found the problem before it became one."

— Vishwajeet, Team Lead, StackGen DevOps/SRE
All

Start typing to search...