INTERNAL CASE STUDY | Aiden for Observability

StackGen Manages Environments Across Three Clouds With Aiden for Observability

How a 4-person DevOps team eliminated multi-cloud log sprawl, cut query time by 80%, and started catching incidents before they happened — all without leaving a single chat window.

results

3→1

Observability tools consolidated

80%

Reduction in query time

2 min

AI root-cause analysis

StackGen Manages Environments Across Three Clouds With Aiden for Observability-1

Vishwajeet Kumar

Team Lead, StackGen DevOps/SRE

“With ObserveNow and Aiden, our 4-person team manages environments across AWS, GCP, and Azure without context-switching between cloud consoles.”

highlights

Background

Challenges

Solution

Results

Real Incidents: Aiden in Action

Platform Capabilities Used

Schedule a demo

StackGen is an Autonomous Operations Platform that helps enterprises manage infrastructure, DevOps, and SRE workflows through AI-powered agents. The platform includes Aiden, an agentic AI assistant, and ObserveNow, a centralized observability solution for metrics, logs, and traces.

As StackGen's own customer base grew, the internal DevOps/SRE team faced the same observability challenges its platform is designed to solve. This case study documents how the team adopted Aiden for Observability to manage their own production and development environments — a true "eat your own cooking" story.

Tech stack: Multi-cloud Kubernetes across AWS, GCP, and Azure. The team manages a growing number of customer environments, including those of enterprises like Autodesk and Nielsen, and fast-growing companies like Innovaccer and InMobi, alongside 3 internal development clusters. Total environments monitored exceed the customer tenant base by roughly 50% when internal clusters are included. All managed by a 4-person DevOps/SRE team.

Multi-Cloud Log Sprawl:

Clusters and deployments were scattered across AWS, GCP, and Azure. Each cloud provider had its own default logging solution. When an alert fired, engineers spent 5+ minutes just opening the correct cloud console and another 5–10 minutes crafting the right query to retrieve relevant logs.

Alert Noise and False Positives:

The team received a steady stream of alerts from Kubernetes clusters across three clouds. Many were outdated or no longer relevant such as alerts for nodes that had already been replaced by the autoscaler. Without a way to quickly validate alert context, every alert required manual investigation. The cognitive load was accumulating.

Maintaining the Monitoring Stack Itself:

Beyond the alert-response workflow, the team also carried the ongoing burden of managing the observability infrastructure as a service in its own right. Keeping Prometheus healthy across three cloud environments meant continuous work on storage tuning, retention policy management, and federation configuration, none of which is related to actual application reliability. Upgrades to the monitoring stack required the same care as a production deployment. When cardinality spiked during large deploys, Prometheus would come under memory pressure at the worst possible moment — the monitoring would become unreliable precisely when monitoring mattered most.

This was toil that had nothing to do with serving customers. It was engineering time spent keeping the lights on for the tools that were supposed to keep the lights on.

No Centralized Observability:

Metrics, logs, and traces were siloed across different tools and cloud-specific dashboards. Engineers had to visit at least three different places: a logging solution (PromQL/LogQL), per-cloud metric dashboards, and Kubernetes clusters to piece together the full picture during an incident.

Centralized Logging, Metrics,
and Traces with ObserveNow:

All telemetry data from every cloud provider and workload was funneled into ObserveNow, eliminating the need to jump between AWS CloudWatch, GCP Cloud Logging, and Azure Monitor. Engineers now have a single pane of glass across all environments and can check live node utilization directly in the Aiden chat window without navigating to any external dashboard.

Aiden Integration
with ObserveNow:

Aiden was integrated with ObserveNow so it can fetch dashboards and run log queries on behalf of the user. Instead of manually crafting PromQL or LogQL queries, engineers ask Aiden in natural language and get instant, correlated results across all clouds.

Custom Knowledge Bases
Per Workspace:

The team used Aiden's knowledge base feature to teach the agent about correct namespaces, deployment names, and special configuration details for each workload. Each workspace has its own knowledge base, so Aiden queries the right data automatically without engineers needing to specify context on every request.

Workspace Segregation for Governance:

The team used Aiden's workspace concept to separate development and production environments, ensuring data isolation and role-based access control (RBAC). Development workloads and their telemetry data are completely segregated from production.

Results

before after

3+ tools to check during incidents

1 unified platform: ObserveNow + Aiden

5 min just to open the right cloud console

Instant access via Aiden natural language

5–10 min to manually craft log queries

Aiden auto-runs queries using Knowledge Base

No separation of dev/prod observability

Workspace-based RBAC and data isolation

Manual context-switching across AWS, GCP, Azure

Single pane of glass across all clouds

Alert noise without automated triage

Aiden filters false alerts in seconds

Incident 2: CrashLoopBackOff — Root Cause Found in Under 2 Minutes:

A service entered a CrashLoopBackOff state. An engineer asked Aiden to investigate. Within seconds, Aiden pulled the recent Kubernetes events for the affected pod, reviewed the pod logs for error details, and identified the likely root cause: a database migration script that had failed to complete cleanly. The engineer was pointed directly at the migration script for review, skipping the usual 10–15 minute archaeology through logs across multiple tools.

Total Aiden analysis time: under 2 minutes.

Incident 3: Capacity Crisis Caught 2 Hours Before Impact:

This is the story the team is most proud of and the one that best illustrates what Aiden for Observability does that traditional alerting cannot.

While Aiden was investigating an unrelated issue, it noticed a dashboard showing a combined ingestion log volume from a service growing at an unusually fast rate. Aiden calculated that, at the current ingestion rate, the persistent volume claim would be full within 2 hours at which point the service would be disrupted.

Under the traditional alerting model, the team would have received an alert with roughly 15 minutes to act once the volume threshold was crossed. Instead, Aiden surfaced the issue proactively, giving the team a 2-hour remediation window. They investigated, identified the cause, and resolved it before any customer impact.

Traditional Alerting ~15 min to act after threshold breach	Aiden for Observability ~2 hours of advance warning

Platform Capabilities Used

ObserveNow

Centralized logs, metrics, and traces from all cloud providers into a single observability layer — the foundation that makes every Aiden query possible.

Aiden Natural Language

Engineers query any log, metric, or trace in plain English. Aiden translates to PromQL/LogQL, runs the query, and returns correlated results — no manual query crafting required.

Knowledge Base

Custom per-workspace configuration guides Aiden to correct namespaces, deployments, and workload-specific details without requiring context from the engineer on every request.

Workspaces

Segregated dev and production environments with RBAC for data isolation and governance. Each environment has its own Aiden context, knowledge base, and access controls.

Proactive Analysis

Aiden doesn't wait for alerts — it surfaces anomalies and capacity risks as a byproduct of normal investigation, converting reactive on-call into proactive reliability management.

Platform Overview

MCP Server

Integrations Overview

Aiden for SRE

Aiden for Infrastructure

Aiden for Observability

Systems Don't Lie: Director of Engineering, PocketFM on Reducing Uncertainty During Incidents

Agentic Developer Experience

Brownfield Applications

Greenfield Applications

Managed OSS Observability

Developers

DevOps

Engineering Leaders

Platform Engineers

SRE

Systems Don't Lie: Director of Engineering, PocketFM on Reducing Uncertainty During Incidents

About

Newsroom

Contact Us

Careers

Analysts

Systems Don't Lie: Director of Engineering, PocketFM on Reducing Uncertainty During Incidents

Blog

Videos & Webinars

Whitepapers, E-books and Brochures

Events

Stacked Up

Documentation

Case Studies

Systems Don't Lie: Director of Engineering, PocketFM on Reducing Uncertainty During Incidents

INTERNAL CASE STUDY | Aiden for Observability

StackGen Manages Environments Across Three Clouds With Aiden for Observability

3→1

80%

2 min

Vishwajeet Kumar

highlights

Background

Challenges

Solution

Real Incidents: Aiden in Action

ObserveNow

Aiden for OSS O11Y

Blog

Download Brochure