Why MCPs Are the Missing Piece in Every Platform Engineer's Toolchain

Written by Navin Pai | Apr 8, 2026 8:32:11 AM

You've done everything right.

You standardized on Terraform. You built the internal developer portal. You wrote the runbooks, defined the golden paths, and onboarded every team to your platform. And yet, your Slack DMs are still full of "hey, can you just quickly provision this?" tickets. State file conflicts are still blocking deployments on Friday afternoons. New engineers are still taking three weeks to make their first infrastructure change.

The problem isn't your platform. It's the interface layer between your platform and everyone who needs to use it.

Model Context Protocol [MCP] is changing that interface layer. And for platform engineers specifically, it's the most operationally significant development since Terraform modules became the standard unit of infrastructure composition.

This post breaks down exactly what MCPs enable for platform engineers, why the timing matters now, and how teams at hyperscaler scale are using AI agents + MCPs to eliminate the toil that's been eating your sprints for years.

What Is MCP, and Why Should Platform Engineers Care?

MCP (Model Context Protocol) is an open standard that lets AI agents connect to external tools, APIs, and systems in a structured, secure way. Think of it as a universal adapter layer: instead of building one-off integrations between your AI tooling and your infrastructure systems, MCP provides a standard protocol that any compliant agent can use to read from and write to your stack.

For platform engineers, this means your AI agent, whether that's Aiden, Cursor, or any other MCP-compatible tool can directly interact with your Terraform state, your Kubernetes API, your CI/CD pipeline, your IAM systems, and your cost data without requiring bespoke glue code for every integration.

The practical upshot: instead of a developer opening a ticket that sits in your queue for three days, they open a chat window, describe what they need in natural language, and an AI agent uses MCPs to provision the resource with all your guardrails, policies, and approval gates enforced in the background.

That's not a future state. Teams are shipping this today.

The Real Problem MCPs Are Solving

Let's be precise about the pain, because the solution only makes sense if you understand what it's solving at the operational level.

The Ticket Queue Is a Structural Bottleneck

Most platform teams are running a hidden tax on engineering velocity. Every infrastructure change that requires a platform engineer to hand-hold it through, whether because the developer doesn't know the IaC syntax, the PR review process is opaque, or the approval chain requires three different people, is burning hours your team doesn't have.

The math is ugly: if your team handles 40 infrastructure requests per week, and each one requires an average of 45 minutes of platform engineer time (intake, context-switching, review, apply, verification), that's 30 hours per week of toil.

MCPs let AI agents handle the well-understood parts of that workflow end-to-end resource provisioning from approved modules, environment cloning, IAM role assignment from predefined templates, cost estimation before applying, and escalate only the genuinely novel decisions to your team.

State File Conflicts and Drift Are Still Manual Interventions

State file locking conflicts remain the #1 cause of blocked deployments in multi-team Terraform environments. When terraform apply fails with a lock conflict at 4:45 PM, someone has to manually inspect the state, determine whether the lock is stale, and decide whether it's safe to break. That someone is usually you.

MCP-enabled agents can be given read/write access to your Terraform state backend with scoped permissions. They can detect stale locks, surface the context of the last operation that acquired the lock, and, with the right guardrails, resolve conflicts according to your team's defined policies without paging anyone. The same pattern applies to drift detection: instead of a weekly terraform plan review buried in a Notion doc nobody reads, an agent continuously monitors for drift between your declared state and actual deployed infrastructure, surfaces it in context (in the same tool the developer is already using), and proposes the remediation.

Policy Enforcement Breaks at the Human-Tool Interface

Your OPA policies are solid. Your Sentinel rules are solid. The problem is that enforcement happens at plan time, which means developers don't find out their resource request violates a policy until after they've written the Terraform, opened the PR, waited for CI, and gotten to the apply stage. That's a 30-minute feedback loop on what should be a 30-second check.

MCP servers that expose your policy engine as a queryable tool change the feedback loop. An AI agent with access to your OPA endpoint can validate a proposed resource configuration before the developer writes a single line of HCL at the point of natural language intent, not at the point of code submission. Failed audits because of undocumented infrastructure changes become a non-issue when every change is brokered through an agent that enforces policy as a precondition, not an afterthought.

The Executive AI Mandate: Why This Is Happening Now

Here's the strategic context your engineering leadership is already sitting with: every hyperscaler that tried to solve this problem internally has now open-sourced or commercialized its approach.

Uber's Genie system, their internal AI infrastructure agent, saves an estimated 13,000 engineering hours per year. Microsoft's Azure SRE Agent, now generally available, is reported to save 20,000 hours annually across their internal teams. Meta, Google, eBay, Instacart, Slack, and AWS have all built and deployed internal GenAI systems for infrastructure and operations workflows.

The executive question is no longer "should we explore AI for infrastructure?" It's "why don't we have this yet, and what's it costing us to wait?"

MCPs are the protocol layer that makes this accessible to teams that aren't hyperscalers. You don't need 50 ML engineers to build Genie from scratch. You need MCP-compatible AI tooling, a platform team that understands how to scope tool permissions correctly, and the right agent infrastructure to connect them.

That's a platform engineering problem, and it's yours to own.

Four Concrete MCP Use Cases for Platform Engineers

1. Self-Service Infrastructure Provisioning with Guardrails

The golden path problem: you've built golden path Terraform modules, but adoption is still lower than you'd like because the developer experience of using them is clunky. Developers have to find the right module version, understand the required variables, write the configuration, run terraform plan, interpret the output, and open a PR all before they get feedback on whether what they're doing is even allowed.

With MCPs, the workflow becomes: developer describes the resource they need in natural language to an AI agent → agent uses the MCP connection to your module registry to find the correct module → agent generates the configuration → agent queries your policy MCP to validate compliance → agent runs terraform plan via your CI MCP and surfaces the plan output in context → developer reviews and approves with a single confirmation.

The entire workflow happens in the same tool the developer is already using. Your guardrails are enforced at every step. Your platform team isn't in the loop until the approval step and for routine provisioning from approved modules, you can define policies that eliminate manual approval entirely.

Aiden for Infrastructure is built on exactly this pattern, natural language intent to infrastructure code, with policy enforcement and module governance built into the agent workflow.

2. Intelligent Drift Detection and Remediation

terraform plan on a production workspace showing 47 unexpected changes is one of the worst feelings in platform engineering. You know something drifted. You don't know what caused it, when it happened, or whether the drift is benign (a tag change) or critical (a security group rule was manually edited in the console).

MCP servers that connect your agent to AWS Config, your Terraform state backend, your CloudTrail logs, and your ticketing system let you answer all four questions in seconds rather than hours. The agent correlates the drift with CloudTrail events to find the timestamp and identity of the change, cross-references with your ticketing system to determine whether the change was authorized, classifies the drift by severity according to your policy definitions, and proposes the remediation plan either to codify the legitimate drift or to revert the unauthorized change.

That's not just faster. It's a fundamentally different posture: drift becomes a detected, classified, and resolved event rather than a ticking compliance time bomb.

3. Automated Cost Attribution and Right-Sizing

Your FinOps team is spending three people's full-time effort reading Cost Explorer dashboards and sending Slack messages asking teams to right-size their resources. Cloud spend grows 30% year-over-year while traffic grows 10%, and nobody can fully explain the gap.

MCP connections to your cloud cost APIs, your Kubernetes metrics server (for HPA utilization data), and your infrastructure state let an AI agent do the correlation work automatically. Instead of a quarterly right-sizing exercise that nobody has bandwidth to action, the agent continuously monitors utilization across your ECS tasks, RDS instances, and EC2 fleets, identifies candidates for right-sizing based on configurable thresholds, and generates the Terraform diff for the change, including a cost impact projection.

Teams using this pattern typically find 15–25% in recoverable cloud spend in the first 90 days. More importantly, they stop the cycle of over-provisioning that happens when developers request resources without real-time cost feedback.

4. Incident Context Enrichment During On-Call

Your on-call rotation is the #1 reason engineers leave. It's not the hours — it's the chaos of receiving a PagerDuty alert at 2 AM, not knowing which service is affected, spending the first 45 minutes assembling context from six different tools before you can even begin diagnosing the issue.

MCPs that connect your AI agent to your observability stack, your deployment history, your service dependency graph, and your past incident data let you change the first-response experience. When an alert fires, the agent automatically assembles the relevant context: which services are affected, what changed in the last 24 hours (deployments, config changes, traffic shifts), what similar incidents have looked like in the past, and what the runbook says to check first.

That context assembly, which normally takes 45–90 minutes of manual work, happens in under 60 seconds. Your on-call engineer arrives at the problem already oriented, with hypotheses to test rather than a blank screen and a Slack thread to dig through.

This is where Aiden for DevOps and the broader StackGen platform play: connecting AI agents to your full operational context so that incident response starts with intelligence, not archaeology.

What Good MCP Architecture Looks Like for Platform Teams

Deploying MCPs well requires intentional design. A few principles that matter in practice:

Scope permissions narrowly. Each MCP server should expose the minimum necessary capabilities. Your Terraform state MCP should not give an agent the ability to delete workspaces. Your cost MCP should be read-only unless you've explicitly designed and tested write operations. Treat MCP tool permissions with the same rigor you apply to IAM role ARNs; least privilege is not optional.

Version-control your MCP server configurations. If your MCP server definitions aren't in code, they will drift. Store them in the same repositories as your platform tooling, apply the same review process, and enforce the same change management discipline.

Design for auditability from day one. Every operation an AI agent performs via MCP should be logged — what tool was called, with what parameters, by which agent, on behalf of which user, with what result. This isn't just good hygiene; it's the audit trail you need for SOC 2 and your security team's inevitable questions about "what is the AI doing to our infrastructure."

Build human-in-the-loop checkpoints for high-blast-radius operations. Automated provisioning of a t3.micro for a dev environment? Fine to approve automatically. Automated changes to a production security group or an IAM policy? Require explicit human confirmation, with the full proposed change surfaced in context before approval. The goal is not to remove humans from the loop, it's to remove humans from the parts of the loop that don't require human judgment.

The Platform Engineer's Competitive Advantage

Here's the honest framing: the platform engineers who understand MCPs and can design, deploy, and govern MCP-based agent systems in the next 12–18 months are going to be operating at a different level of leverage than those who don't.

Your job has always been to build the platform that makes everyone else more productive. MCPs extend that mission to the AI layer. Instead of your developers using AI coding tools that have no awareness of your infrastructure, your policies, or your operational context, you can build an AI-native platform layer where every agent interaction is grounded in your actual systems.

That's not a moonshot project. It's the next evolution of the work you're already doing.

Teams using StackGen's AI-powered infrastructure automation are cutting infrastructure provisioning time from days to minutes, reducing state file incidents by over 60%, and reclaiming 8–12 hours per engineer per week that was previously spent on toil. The pattern is repeatable, the tooling is mature, and the ROI is measurable.

Getting Started

If you're evaluating how to bring MCPs into your platform engineering workflow, the practical starting point is identifying the two or three workflows in your current operation that consume the most platform team time and have the clearest automation boundary.

State file conflict resolution. Routine resource provisioning from approved modules. Cost anomaly detection. Policy validation pre-PR. Pick the one with the highest toil-to-value ratio and design an MCP-enabled agent workflow around it. Instrument it well. Measure the before and after.

The teams winning with this approach aren't doing it all at once. They're finding the 20% of workflows that create 80% of the tickets and automating those first.

If you want to see how StackGen's Aiden AI Agent approaches this with platform engineering teams, specifically including how we handle policy enforcement, state management, and human-in-the-loop design book a demo, and we'll walk through it with your actual stack in mind.

For engineering leadership context on building the business case for AI-native infrastructure tooling, see our Engineering Leaders solutions page.

View full post