When Infrastructure Learns to Fix Itself: Inside the Rise of AgentOps

When Infrastructure Learns to Fix Itself: Inside the Rise of AgentOps

October 27, 2025
34 views
Get tips and best practices from Develeap’s experts in your inbox

Imagine a night with no one on-call. An incident hits… and the system heals itself.
Now imagine that same system – confidently making the wrong move.
That’s the frontier of AIOps today — and the rise of what researchers call “AgentOps.”

We talk a lot about DevOps automation, much less about automation that thinks – and doesn’t ask for permission. We’re heading towards an era where infrastructure can self-repair. The question is: what happens the day they start mis-repairing?

Recently, I dug into a research paper (using NotebookLM obviously) from Cornell University and Microsoft: “AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds.” They stood up live microservice systems, injected real faults, drove workload, and measured agent behavior end-to-end across the incident lifecycle. They call this direction AgentOps – agents that own detection, localization, root cause, mitigation, and learning loops.  

How Researchers Test in Agents Can Think  (P = ⟨T, C, S⟩)

The authors of the research formalize every evaluation as a tuple P = ⟨T, C, S⟩:

  • T (Task) – one of four stages in the incident lifecycle: Detection, Localization, Root Cause Analysis (RCA), Mitigation. Each task has success criteria and metrics (e.g., Time-to-Detect for detection).  
  • C (Context) = ⟨E, I⟩E is the operational environment (cloud service, injected fault model, workload) not shared with the agent; I is the problem information available to the agent (service docs, task description, APIs, and queryable telemetry like logs/metrics/traces).  
  • S (Expected solution/oracle) – task-specific ground truth (e.g., the faulty microservice for localization), used for evaluation.  

To keep things real, AIOpsLab deploys DeathStarBench microservices (“SocialNetwork,” “HotelReservation”), drives workload, and uses a fault library wired through ChaosMesh, with full observability via Prometheus, Jaeger, Filebeat/Logstash.  

The 4 Tasks Every AI Operator Must Master

They define a simple, useful task ladder:

  1. Detection – did the agent detect an anomaly?
  2. Localization – can it pinpoint the faulty component (e.g., service/pod)?
  3. RCA – can it identify the underlying cause (e.g., misconfig vs. operational error)?
  4. Mitigation – can it propose/execute an effective recovery?

The higher the level on the ladder, the harder and more impactful it becomes.  

Faults span symptomatic (degradation, crashes) and functional (e.g., config/privilege/release bugs) across application and infrastructure layers. This lets them grade not just “did you notice something is wrong?” but “can you fix the real cause without collateral damage?”  

Why Guardrails Matter

The Orchestrator is the middle-layer that keeps agents honest, trained, and the system safe:

  • Provides an Agent-Cloud Interface (ACI) – a tight, documented API surface designed for LLMs (not humans), so agents don’t drown in dashboards and irrelevant noise. Core actions include get_logs, get_metrics, get_traces, exec_shell (policy-filtered). It also defines observations: how the system state/telemetry is reflected back after each action.  
  • Runs session-based control – each problem is a session. The Orchestrator gives the agent the problem prompt, instructions, and the allowed APIs, and expects the agent to implement a get_action(state) -> str loop.  
  • Handles problem initializers (deploy service + workload + fault injection) and evaluators (compare the agent’s submitted solution to the oracle, compute metrics, and log every action/state for replay and analysis).  

Think of it as: tight action space + observable feedback + reproducible sessions + automatic scoring. Exactly the constraints we’d want before letting agents press buttons in prod.

Inside the Experiment: Agents, Faults, Metrics

Agents

  • FLASH – a compound “workflow” agent that breaks problems into conditional segments and leverages chain-of-thought + tools; highest accuracy in aggregate.  
  • REACT – classic Reason-Act loop agent.
  • GPT-4-with-Shell and GPT-3.5-with-Shell – direct LLM agents with shell/tool access (naive registration, no fine-tuning).  

Problem pool

  • The problem pool included 48 problems across the four task levels, created from 10 fault types (auth missing, revoke auth, target pod misconfig, buggy app logic, pod failure, network loss, etc.). Each fault can be injected at different target components to vary the blast radius and propagation topology. Total 288 test cases.  

Metrics (how they score)

  • Correctness – success against the task oracle (did you detect/localize/analyze/resolve?).
  • Time/Steps – e.g., Time-to-Detect (TTD), Time-to-Mitigate (TTM); plus number of actions/steps.
  • Costinput/output token counts for the agent+environment loop, a practical proxy for costs prediction in $, and scalability.  

Results (the good, the bad, the “not yet”)

Although the results were posted at the beginning of 2025, I still find them disappointing. Well, it depends on which group of the AI you’re in – are you from those who adopt AI as part of your work and just wait for such an “AgentOps” to get into your environment (as long as it will be trustworthy)? Or are you from the group that’s afraid to put such an agent in your environment?

For the FLASH, it has the highest accuracy overall, but is the slowest; GPT-3.5-with-Shell is the  fastest but least accurate; REACT is close to FLASH in accuracy with moderate latency.  

By task:

  • Detection is “easy mode” (binary) – FLASH aces it; others vary.  
  • Localization – accuracy drops for most agents once you require a specific service/pod; using top-k answers helps a bit.  
  • RCA and Mitigation are where things break… accuracy declines sharply; mitigation demands multi-step interaction with the environment (e.g., rollback, config patching), and that’s where the agents stumble.  

Step limits matter: Allowing more steps initially boosts accuracy (agents can explore/verify), but beyond a point, self-repair becomes kind of burdensome and oppressive – extra steps just burn tokens without better outcomes.  

API behavior: Agents disproportionately rely on get_logs; get_metrics and get_traces are under-used or misused. FLASH interestingly doesn’t call get_traces at all.  

Why Agents Still Fail in the Real World

  1. Wasting steps on unnecessary actions
    Repeating the same API, inventing non-existent APIs, or over-talking between tools. Example: GPT-3.5-with-Shell looping erroneous commands or over-communicating between tool calls.  
  2. Over-consuming raw telemetry
    Agents sometimes cat huge logs/metrics directly, bloating context and distracting the model; they rarely use structured queries or filters, which hurts reasoning. The paper argues for better telemetry abstractions/filters in the interface.  
  3. Invalid API usage & brittle command formats
    Wrong parameters or namespaces; some agents apologize and repeat the same mistake; others recover after trial-and-error.  
  4. False positives & misinterpretation
    In zero-fault scenarios, some agents mislabel normal activity as failures or misattribute the cause, highlighting the need for context-aware scoring and LLM-as-Judge traces to debug reasoning.  

Lessons DevOps Can Steal from AIOpsLab

  • Guard-railed ACI: a small, LLM-friendly action set + clear observations beats dumping Grafana into a prompt.
  • Session loops with automatic evaluation: every action, state delta, error, and token cost is logged; success is judged against an oracle, plus qualitative review with LLM-as-Judge when needed.
  • Composable problem initializers: deploy → inject fault → drive workload → observe → score – repeatable and extensible.  

This is the right design pattern for letting agents touch real systems without chaos.

The Human Role in a Self-Managing Future

  • AgentOps isn’t a future slide… It’s here, but it’s uneven and not yet mature enough for production. Agents can effectively identify RCA and safe mitigation, both when issues are well-identified and when they are obscured by noise or partial observability.  
  • Interfaces beat prompts. Shrinking the action surface, adding structure (filtered telemetry, typed APIs), and enforcing sessions, improves outcomes more than “bigger model, bigger context.”  
  • Steps are not a strategy. More steps help… until they don’t. We need to incorporate planning, verification, and rollback disciplines into the agent loop.  

And our role, as DevOps Engineers, AI Engineers, and leaders, doesn’t disappear – it evolves. Our responsibility evolves to a new zone, where we’re teaching systems how to think like engineers: investigate, hypothesize, test, recover, and learn.

For DevOps engineers, this is more than a research milestone. It’s a glimpse of how our daily workflows will evolve and why human judgment remains the most valuable component in any automated system.

I’m optimistic. Today’s agents aren’t yet your SRE on night shift, but… with the right orchestrator, interface, scoring, and continuous learning – they’ll become a very capable agent that gets better each week.

Skip to content