Skip to content

The 4-Body Problem of SRE: Building an Agentic OS for Autonomous Operations

Architecting Agentic Systems – the Operations edition

Author:
Sanjeev Sharma | Jun 09, 2026
The 4-Body Problem of SRE: Building an Agentic OS for Autonomous Operations
Topics

Share This:

A few years ago, I sat through an incident bridge call that I still cannot forget. A large global enterprise was running a business-critical application across two public clouds, with a multi-cloud management platform layered on top, application development outsourced to one vendor, managed services outsourced to another, network handled by yet another, and just to make life interesting, a possible security breach in scope. Performance had degraded. SLOs were burning. Customers were unhappy. Lawyers were lurking. The bridge convened with eight (yes, eight) parties on the line.

Someone, somewhere in the room, pulled out a 200-row RACI spreadsheet to figure out who was supposed to do what. I will spare you the suspense: the RACI was useless. Every vendor showed green dashboards from their own slice of the world. The two cloud vendors blamed each other. The multi-cloud orchestration vendor walked off the call the moment it heard a competitor's cloud was involved. Hours of "let me check with my engineering team and get back to you" went by. The actual root cause, when we eventually found it, was hiding in a subcontracted network telemetry stream that no one had bothered to integrate into anyone's observability stack.

That night was the most expensive demonstration I have ever seen of two truths about modern operations:

  1. No single party has the whole picture.
  2. There is no human, no matter how experienced, who can hold the whole picture in their head.

This is what I have come to call SRE's 4-Body Problem, and it is the reason I have been spending most of my time recently architecting what I believe is the only sustainable answer to it: an Agentic OS for autonomous operations.

People Putty: How We've Been Holding the Stack Together

Before we get to agents and Agentic OSes, I want to be honest about how Site Reliability Engineering has actually been done in most enterprises I've worked with, including the very mature ones.

It runs on what I have heard called as “people putty”. Critical context lives, in fragments, in the heads of three or four senior engineers who have been around long enough to know which Terraform module actually owns the VPC peering, why that one Kubernetes namespace can never be drained on a Friday, and which alert rule was tuned down two years ago because it kept paging at 3 a.m. for a reason no one wrote down.

Runbooks? Sure. We have runbooks. Hundreds of them. Most of them were last accurate at the moment they were written, which was usually right before the next infrastructure change drifted them out of relevance. Stale runbooks are arguably worse than no runbooks, because they invite confident, wrong action.

People putty is what we use when we don't have a system. It is the human glue between siloed tools, siloed teams, and siloed signals. It is also the single biggest blocker to autonomous operations.

Every story like the 8-vendor bridge call I opened with is, at its heart, a story about people putty failing under load. When the system gets complex enough, no human can be the integration point any longer. We can plead for "better collaboration" all we want, but the math of human attention, working memory, and 2 a.m. cognition will eventually win.

The 4-Body Problem

So why is operations specifically so hard? Why do we keep ending up on 11 p.m. bridge calls with 8 vendors and a 200-row RACI?

The First Principles answer, I think, is this: an SRE, human or an agent, has to reason simultaneously over four tightly coupled bodies of truth.

  • Code: every commit, PR, branch, build artifact, version, and configuration change. What did we deploy, when, and what was different from yesterday?
  • Infrastructure State: the actual, current shape of the cloud accounts, networks, clusters, queues, databases, and IAM policies. What does Terraform say should be there, and what is actually there, right now?
  • Runtime Signals: metrics, logs, traces, events, error budgets, SLOs, customer-impacting alerts. What is the system doing, right this second, how is that different from what it normally does, and when did it change behaviour?
  • Operational Knowledge: the tribal wisdom, the post-mortems, the architectural decision records, the "we tried that in 2022 and it took the whole region down," the runbooks, the on-call playbooks, the Slack threads that explained why a thing is the way it is and not the way is was designed.

Each of these four bodies, in isolation, is something we mostly know how to manage. We have Git for code. We have Terraform and the cloud provider APIs for infra state. We have Prometheus, ObserveNow, Datadog, New Relic, Splunk, and forty other tools for runtime signals. We have Confluence, Notion, and Slack for operational knowledge. On second thought, "manage" is a strong word in this context.

The problem is not any one of the four. The problem is that every meaningful operational decision sits at the intersection of all four. And the intersection is where we have, historically, had no system at all.

A useful analogy: in physics, the 3-body problem describes a system of three gravitating masses whose long-term motion has no closed-form solution. It is famously chaotic. Adding a fourth body does not just make it 33% harder – it makes the dynamics qualitatively wilder. (I know, I know, the Liu Cixin novel is having a moment. I am borrowing the framing, not, spoiler alert, the alien invasion.) Modern SRE is in the same neighborhood. Code, Infra State, Runtime Signals, and Operational Knowledge are not four independent systems. They are four masses whose orbits affect each other constantly, and our job is to predict where the system is going to be in 30 seconds without losing customer trust in the meantime.

Code, Infra State, Runtime Signals, and Operational Knowledge are four masses whose orbits affect each other constantly. SRE is the practice of predicting where the system will be in 30 seconds without losing customer trust in the meantime.

The only entities I know of that have historically navigated this 4-body problem successfully are people. Specifically, the senior engineers and SREs whose brains have, painstakingly, over years, built up an internal model that simultaneously tracks all four bodies. They are precious, expensive, and they leave the company every two years. And they cannot scale to the cloud-native, multi-cloud, multi-vendor, multi-region surface area of the modern enterprise.

So either we accept that operations is a permanently human-bottlenecked function, and live with the bridge calls, or we build a system that can hold the 4-body problem the way our best engineers do.

That system is what we at StackGen have been calling an Agentic OS for Ops.

From Artifacts to Knowledge Graph to a Live Context Graph

To run agents on the 4-body problem, the first thing you need is a substrate they can read, write, and reason against. Today, we don't have one. Git is over here. Terraform state is over there. Traces are in a SaaS tool. Runbooks are on a Confluence page that hasn't been updated since the last platform team reorg. Each tool is excellent at its own job and almost completely unaware of the other three.

You cannot put an AI agent on top of that and expect it to behave reliably. An agent is only as good as the context it can reason over, validated by the evals you can run against it. If the context is fragmented, the agent will hallucinate, or worse, plausibly hallucinate, which is the kind of failure that is hard to catch in review.

1. From Artifacts to Knowledge Graph to a Live Context Graph

What we need instead is a Live Context Graph: a single, real-time, queryable, observable, and durable representation of all four bodies and the relationships between them, derived from the Knowledge Graph .

  • Live because stale context is worse than no context. Yesterday's view of the world is not the world.
  • Cobtext graph because the value is not in any individual node (a commit, a Terraform resource, a span, a runbook). The value is in the edges – this commit changed that module that provisioned this resource that emitted this latency anomaly that matches this past incident, which had this fix.
  • Queryable by both humans and agents, with consistent semantics across both.
  • Observable because the graph itself has to be a first-class telemetry source. We need to know when it stops being live.
  • Durable because every decision an agent makes will be made against a specific version of the graph, and we need to be able to replay it, for validation and auditability.

Once you have a Live Context Graph, agents become possible. Without it, they are (slick) demos.

Where the Agentic OS Plugs In

If the Live Context Graph is the substrate, the Agentic OS is the runtime on top of it. It is the place where reasoning agents, policy, guardrails, and human-in-the-loop approvals all live. (For the principles behind why an agent needs Autonomy, Agency, Guardrails, and Trust, see this post – I'm not going to relitigate them here.)

The three places where, in my experience, an Agentic OS first starts to pay for itself are also the three places people putty is thickest:

CI/CD pipelines. Every commit, build, test,deploy and learning from past incidents becomes part of the agent's working memory. When an SLO breach happens at 14:07, the agent doesn't have to ask "did anything change?" – it already knows. The deploy that touched the connection-pool config at 13:54 is right there, with its diff, its author, and its blast radius. That is a very different starting point than "let's go check the deploy channel in Slack”. Future state: The Agent will be able to review the commit and based on its memory/knowledge identify how it will affect the system in future or recommend optimizations based on its knowledge of the system. This prevents future incidents even before the code is committed.

Infrastructure drift. Terraform tells you what should be there. The cloud tells you what is there. The Agentic OS continuously reconciles the two and flags drift before it becomes an outage. More interestingly, it can flag intentional drift – the kind that happened because someone with prod access made a break-glass fix on a Friday night and meant to PR it later but didn't. We can stop pretending those don't happen and start treating them as observable events.

SLO violations. Today, an SLO breach mostly results in a page. With an Agentic OS, the breach triggers a traceable agent decision: form a hypothesis, rank it against historical patterns, run a bounded experiment (canary rollback, throttle, scale-out, isolate the bad node) inside policy guardrails, and log every step. The page still happens for the human, but now the human walks into a war room where the first 12 minutes of investigation work has already been done, with citations.

The pattern across all three is the same: agents read from the Live Context Graph, reason about hypotheses, act within policy, and write the result back to the graph. Every loop makes the graph richer, which makes the next loop better. Compounding.

Decision Traces, Not One-Off Automations

Here is where I want to be very careful, because the failure mode in this space is real and I have seen it.

You can build a lot of impressive-looking automation that is, fundamentally, just a script with an LLM in the middle. It runs. It does a thing. Sometimes it does the right thing. When it does the wrong thing, no one really knows why, because the prompt was different last Tuesday and the model version got bumped on Thursday and the input context wasn't captured anywhere.

That is not autonomy. That is a Rube Goldberg machine with extra steps.

Scripts and alerts are opaque and have no memory. Decision traces are durable, queryable, and immutable. Autonomy you cannot defend is autonomy you do not actually have.

The non-negotiable architectural commitment of an Agentic OS, in my view, is the decision trace. Every action an agent takes produces a durable record of:

  • the inputs it saw (which snapshot of the Live Context Graph),
  • the policies that were in effect,
  • the model/version it used,
  • the hypotheses it considered and rejected,
  • the action it took, and
  • the outcome it observed.

This record has to be auditable (every input logged), reproducible (we can replay it against the same or new state), and trustable (we can answer the "why" before the "what"). Without this, you do not have autonomous operations. You have probabilistic ops with a marketing budget.

This is also, not coincidentally, what your auditor, your CISO, your risk officer, and your regulator are going to ask for the first time an agent does something interesting in production. You may as well bake it in now.

The Future

In the Future state an agent will be able to review the commit and based on its memory/knowledge identify how it will affect the system in future or recommend optimizations based on its knowledge of the system. This way prevent future incidents even before the code is committed.

The Path

So how do we actually get from where most of us are today – the proverbial 8-vendor bridge calls and 200-row RACIs – to autonomous operations?

I would offer two principles, and then I will get out of your way.

First, treat operations as data. Stop letting Code, Infra State, Runtime Signals, and Operational Knowledge live as four separate, mutually-suspicious silos. Unify them into a Live Context Graph. The Context graph is more important than the agents. Without it, no agent will ever be trustworthy enough to put in the critical path.

Second, embed agents in the path, not after it. The pattern of "human does the work, agent writes the post-mortem" is not autonomy. It is dictation. The agents have to be reasoning over Code, IaC, Telemetry, and Knowledge continuously – not after the incident, but as a continuous loop that makes incidents rarer in the first place. Every decision they take adds a trace. Every trace makes the next decision better.

Will we get all the way to fully autonomous operations in 2026? No. Will every enterprise be ready for it on the same timeline? Definitely not. But the trajectory is, I think, clear. The 4-body problem of SRE is not a problem that scales with more humans. It is a problem that scales with better context and trustworthy agents acting on it.

If you are an SRE leader, a platform engineering leader, or a CIO/CTO trying to figure out where to spend your AI dollars in the operations space, my suggestion is simple: don't start with the agents. Start with the graph. Start with the four bodies. Start with making your operational truth queryable. The agents will follow, and when they do, they will actually work.


 

This post is based on a keynote I have been giving recently titled The 4-Body Problem: Building an Agentic OS for Autonomous SRE. If you want me to come run it for your engineering org or leadership team, drop me a line. And as always, push back in the comments below or on my socials. This is very much a work in progress and I want to hear where you disagree.

 

Disclaimer: No bridge call was harmed in the writing of this post. Eight vendors were, however, very mildly inconvenienced.

 

About StackGen:

StackGen is the pioneer in Autonomous Infrastructure Platform (AIP) technology, helping enterprises transition from manual Infrastructure-as-Code (IaC) management to fully autonomous operations. Founded by infrastructure automation experts and headquartered in the San Francisco Bay Area, StackGen serves leading companies across technology, financial services, manufacturing, and entertainment industries.

All

Start typing to search...