INTERNAL CASE STUDY | Aiden for SRE · ObserveNow

How StackGen's Own SRE Team Runs Aiden on Production

Reducing RCA time by 80%, eliminating manual documentation, and catching capacity crises 2 hours before they become incidents without adding headcount.

results

80%

Faster RCA
10+ min → under 2 min

2 hrs

Earlier Warning
Capacity crisis caught before alerting would trigger

Zero

Separate RCA Docs
Shared Chat = built-in audit trail

4

SRE Team Size
Managing multi-cloud + growing enterprise customer base

4 SREs. 3 Clouds. A Growing Enterprise Customer List.-1

Vishwajeet Kumar

Team Lead, StackGen DevOps/SRE

“What used to take 10–15 minutes of manual log hunting now takes under 2 minutes with Aiden running the queries for us. And that’s time we now spend on proactive work — not firefighting.”

highlights

Background

Challenges

Solution

What this looks like in practice

Before Vs After

Platform Capabilities Used

Schedule a demo

StackGen is an Autonomous Operations Platform that helps enterprises manage infrastructure, DevOps, and SRE workflows through AI-powered agents. The platform includes Aiden — an agentic AI assistant — and ObserveNow, a centralised observability solution for metrics, logs, and traces.

As StackGen’s customer base grew, the internal DevOps/SRE team faced a compounding challenge: a multi-cloud Kubernetes estate (AWS, GCP, and Azure), a growing roster of enterprise customers — including SAP NS2, Chamberlain, Corcentric, and Piramal — plus 3 internal development clusters. Total environments monitored exceed the customer tenant base by ~50% when internal clusters are included.

We don’t sell Aiden to SRE teams and then manage our own production with spreadsheets and tribal knowledge. We run the same platform on our own customer environments — with the same consequences when something breaks.

Slow Time-to-Environment
During Incidents:

Getting to the right Kubernetes cluster and establishing the correct context took 5–10 minutes per incident. Combined with log retrieval, it took a minimum of 10 minutes before any real root cause analysis could begin.

High Operational Load
With Growing Customers:

With a growing base of customers and environments to monitor, the 4-person team was stretched thin. DevOps also had to support application developers who needed faster CI/CD and deployment support, leaving little time for proactive reliability improvements

Manual RCA Documentation:

After resolving incidents, engineers had to manually create separate documents to share root cause analysis with the team. This added overhead and slowed knowledge transfer across the small team.

Pre-Built Skills for
Common Alert Patterns:

For the most frequently occurring alert patterns, the team created specific Aiden skills—step-by-step runbooks written in plain English that Aiden and its sub-agents execute automatically. When a skill is invoked, Aiden follows all the defined steps and returns a final answer without requiring multiple back-and-forth conversations.

Direct Kubernetes Integration
via Remote Runner:

Aiden was integrated with Kubernetes clusters running StackGen workloads via a remote runner that operates inside the cluster. This sends results back to Aiden while maintaining controlled permissions and query restrictions. When metrics and traces alone can’t pinpoint the issue, Aiden agentically fetches deployment, service, and ingress-level information directly from K8s.

Shareable RCA via
Shared Chat:

Once root cause analysis is complete, engineers use Aiden’s Shared Chat feature to share the full investigation with other team members. The chat serves as a complete audit trail—no need to create a separate RCA document.

SCENARIO 1 · ALERT TRIAGE

A “Node Not Ready” alert that would have been a rabbit hole

The team received an alert: “Inside cluster developer-eks, node ip-10-0-1-104.us-west-2.compute.internal: node not ready for more than 10 minutes.”

Normally this triggers manual investigation — kubectl commands, log checks, cross-referencing recent deployments. With Aiden, the investigation ran automatically. Within minutes, Aiden determined the node no longer existed in the cluster — it had been replaced by a healthy node. No action required.

The team could also query current node status and utilisation percentages directly in the Aiden chat, without navigating to an external dashboard. The same interface that surfaced the alert answered the follow-up questions.

SCENARIO 2 · INCIDENT RESPONSE

Pod stuck in CrashLoopBackOff: from alert to root cause in under 2 minutes

A pod entered CrashLoopBackOff. Aiden immediately pulled recent Kubernetes events for the pod, retrieved pod logs for error details, and identified the likely cause — a database migration script — in under 2 minutes.

Previously this required manually connecting to the cluster, running kubectl commands, cross-referencing logs, and often escalating to a second engineer for a second pair of eyes. The full sequence routinely took 10–15 minutes before anyone had a hypothesis.

With Aiden, the team had a diagnosis before they’d finished reading the alert.

SCENARIO 3 · INCIDENT PREVENTION

A capacity crisis caught 2 hours early — before any alert would have fired

While debugging a separate issue, Aiden surfaced a dashboard showing combined ingestion log volume from a service. The ingestion rate was growing fast.

Aiden calculated that at the current rate, the persistent volume claim would be completely filled within 2 hours — causing a service disruption.

If the team had relied on traditional alerting, they would have received a warning with approximately 15 minutes to act. Aiden’s proactive analysis gave them a 2-hour window. The team investigated, identified the root cause, and resolved it before any impact occurred.

This is the bigger story.
Pure reactive RCA tools would not have caught this. The value wasn’t faster incident response — it was no incident at all.

BEFORE Vs AFTER

Before WITH AIDEN + OBSERVENOW

10–15 min to reach root cause on a typical alert

Under 2 min to root cause — Aiden fetches K8s events, logs, and deployment data automatically

Manual context-switching across 3 clouds to find the right cluster

Direct K8s integration via Remote Runner — no manual cluster navigation

Post-incident RCA docs created manually after every incident

Shared Chat history = complete, shareable RCA with zero extra effort

Reactive: alerts arrive, team scrambles, archaeology begins

Agentic: Aiden pivots from metrics to K8s context autonomously

Capacity risks only visible after alerting threshold crossed (~15 min warning)

2-hour advance warning on capacity crises — before any alert would have fired

Alert noise with no automatic triage for node-not-ready false positives

False positives auto-resolved: Aiden confirmed a “node not ready” alert was a replaced (healthy) node

Platform Capabilities Used

Skills

Pre-built runbooks for common alert patterns. Aiden executes multi-step investigations automatically without back-and-forth.

Remote Runner

Secure K8s integration that lets Aiden fetch deployment, service, and ingress info directly from clusters with controlled permissions.

Shared Chat

One-click sharing of complete RCA investigations with team members. Also saved for audit compliance.

Agentic Behavior

When metrics/traces are insufficient, Aiden autonomously pivots to K8s integration to gather additional context for root cause.

Platform Overview

MCP Server

Integrations Overview

Aiden for SRE

Aiden for Infrastructure

Aiden for Observability

Negative MTTD: The Most Important SRE Metric for the Next 36 Months

Agentic Developer Experience

Brownfield Applications

Greenfield Applications

Managed OSS Observability

Developers

DevOps

Engineering Leaders

Platform Engineers

SRE

Negative MTTD: The Most Important SRE Metric for the Next 36 Months

About

Newsroom

Contact Us

Careers

Analysts

Negative MTTD: The Most Important SRE Metric for the Next 36 Months

Blog

Videos & Webinars

Whitepapers & E-Books

Events

Stacked Up

Documentation

Case Studies

Negative MTTD: The Most Important SRE Metric for the Next 36 Months

INTERNAL CASE STUDY | Aiden for SRE · ObserveNow

How StackGen's Own SRE Team Runs Aiden on Production

80%

2 hrs

Zero

4

Vishwajeet Kumar

highlights

Background

Challenges

Solution

What This Looks Like in Practice

A “Node Not Ready” alert that would have been a rabbit hole

Pod stuck in CrashLoopBackOff: from alert to root cause in under 2 minutes

A capacity crisis caught 2 hours early — before any alert would have fired

ObserveNow

Aiden for OSS O11Y

Blog

Download Brochure