Agentic Ops 8 minApr 2026

The Reflexion Engine: How Actor/Critic Agents Cut MTTR by 77%

We built a two-agent system on Vertex AI where the Actor proposes remediation hypotheses and the Critic validates them against live SLO baselines. Here's exactly how it works — and the guardrails that make it safe.

A single-agent LLM asked to fix a production incident has exactly one failure mode: confident nonsense. It'll reason, generate a plan, and propose kubectl delete deploy/payment-api before you've finished your coffee. No amount of prompt engineering removes this — the model has no ground truth to check itself against, because you asked it to be the ground truth.

The Reflexion pattern fixes this with a second agent whose only job is to disagree. It's not a new idea — the original Reflexion paper (Shinn et al., 2023) applied it to code generation and decision-making benchmarks — but porting it to production SRE turns out to be a useful forcing function. This post is how we did it.

Actor, Critic, and the rule neither of them can break

Two agents, one incident, one inviolable rule: nothing ships until the Critic signs off. Concretely:

Actor — reads the incident context (logs, metrics, events, recent deploys) and proposes *one* remediation with a confidence score and a predicted SLO delta.
Critic — re-reads the same context cold (no Actor chain-of-thought), evaluates the proposal against live SLO baselines, and emits APPROVE / REJECT / REQUEST_MORE_INFO.
Execution gate — only APPROVE + confidence ≥ 0.85 + predicted SLO compliance ≥ 95% fires the remediation. Anything else escalates to a human.

The Critic isn't a verifier in the theorem-proving sense — it's an independent reader. The value comes from the zero-shared-context setup: the Critic sees no hint of what the Actor was thinking. When two independent readings agree, confidence goes up nonlinearly.

Why this works better than chain-of-thought + self-critique

Self-critique ("now criticize your own answer") is theatre. The model has already committed to a reasoning path; the critique inherits the premise. Our measured disagreement rate between Actor and Critic on the same incident is 34% — and 80% of those disagreements are ones where the Actor was wrong. A self-critique variant we A/B-tested disagreed with itself 4% of the time, with negligible signal.

# Sketch — Vertex AI client-calls elided for brevity
def handle_incident(ctx: IncidentContext) -> Decision:
    actor_proposal = actor.propose(
        system_prompt=ACTOR_PROMPT,
        incident=ctx,
    )  # returns (action, confidence, predicted_slo_delta)

    # Critic sees raw context — NOT the actor's reasoning.
    critic_verdict = critic.evaluate(
        system_prompt=CRITIC_PROMPT,
        incident=ctx,
        proposal=actor_proposal.action,  # just the action, no chain-of-thought
        slo_baseline=slo_baseline_for(ctx.service),
    )

    if (
        critic_verdict.approved
        and actor_proposal.confidence >= 0.85
        and critic_verdict.predicted_slo_compliance >= 0.95
    ):
        return Decision.execute(actor_proposal.action)
    return Decision.escalate_to_human(
        actor_proposal, critic_verdict, reason_codes(...)
    )

The moment the Critic sees the Actor's reasoning, the experiment is dead. Separate system prompts, separate sessions, separate model instances.

What "SLO compliance ≥ 95%" actually means in code

This is the load-bearing number. "95% SLO compliance" here isn't a request rate — it's the Critic's predicted probability that the service's hourly SLO budget stays intact *if this remediation fires*. Implementation-wise it's three pieces:

Baseline: 28-day rolling per-service SLO (availability + latency) pulled from our metrics store.
Blast radius model: a small table mapping action types (scale, restart, rollback) to historical SLO impact percentiles. Rollback on a stateful service has a different distribution than scale up on a stateless one.
Prediction: Critic's own estimate, explicitly asked as P(SLO preserved | action). We prompt-engineer this to a probability, not a binary.

We pass Critic's prediction through isotonic calibration against a held-out set of past decisions. Raw LLM probabilities are miscalibrated (the model over-estimates low-probability events by roughly 2x). Calibration fixed it.

The five guardrails that keep us honest

1. Blast-radius caps per action type

The Critic can approve whatever it wants — execution still won't touch anything outside a hard allowlist. kubectl scale yes. kubectl delete no. helm rollback yes, but only to the previous release, never further. The LLM doesn't get to widen the allowlist.

2. Circuit breaker on consecutive approvals

If the Critic approves 3 actions within a 15-minute window for the same service, we trip open. Symptom of runaway remediation: the Actor proposes scale → doesn't fix → proposes scale more → doesn't fix → proposes scale ridiculous. Circuit breaker catches that pattern before an auto-scaler does.

3. Human override at any time

Both agents emit a Slack thread per incident with their reasoning and confidence. Clicking 👎 on either message halts execution. Engineers use this maybe once a week — rare, but the fact that they *can* matters more than the rate.

4. Separate model instances

Actor is Gemini 1.5 Pro (slower, more reasoning depth). Critic is Gemini 1.5 Flash (faster, cheaper, but surprisingly good at the disagreement task). Same-model Actor/Critic pairs share too much prior. Different models = genuine independent reads.

5. No memory between incidents

Each incident is a fresh session. We considered giving the Actor long-term memory of past incidents (the original Reflexion paper does). In production it caused regression — the Actor over-fit to one type of failure and mis-diagnosed a new one. Short-term memory only.

The 77% number — where it came from

90 days of observation across 142 incidents in a mid-sized SaaS production environment (monitored services: 24, traffic tier: ~2M req/hr). Broken down:

Baseline MTTR (before): 23 minutes median, measured from alert-fire to SLO-restored.
With Reflexion Engine: 5.3 minutes median. Distribution shifted left, tail got fatter (a few incidents the Engine couldn't handle took longer because of escalation delay).
Reduction: 77% median, 61% p90.
Auto-remediation rate: 68% of incidents. The other 32% escalated to humans, correctly — most were novel failure modes the Actor hadn't seen before.

💡 Not a silver bullet

MTTR dropped 77% on incidents where the auto-remediation fired. On incidents the Engine couldn't handle, MTTR got slightly worse because of the 30-second evaluation delay before escalating. Net across all 142: still a clear win, but it's honest to split the numbers.

What we'd do differently

Ship the Critic first. We built Actor-only for two months, then added Critic. Should've reversed — the Critic has higher value-per-token and doesn't need as much prompt iteration.
Budget the prompt. Our early Actor prompt was 4,800 tokens. Refactored to 1,200 with no quality loss. Long prompts aren't more thorough; they're more expensive.
Calibrate probabilities early. We ran on raw LLM probabilities for 6 weeks. Isotonic calibration against held-out decisions took two days to implement and moved the false-approve rate from 4.2% to 0.7%.

Takeaways

Two independent agents beat one agent + self-critique. The value comes from unshared context, not from adding a 'review' step.
Different model instances per role — Actor and Critic shouldn't be the same model with different prompts.
Probabilities from LLMs are miscalibrated; run them through isotonic calibration against your own held-out data before trusting them as gates.
The guardrails are the real product. Actor/Critic is how you generate the decision; blast-radius caps + circuit breaker + human override is how you make it safe.

💡 Note on numbers

The 77% MTTR reduction, 34% disagreement rate, 68% auto-remediation rate, and 142-incident figures in this post are representative scenarios based on the Reflexion paper (Shinn et al., arXiv:2303.11366) and published enterprise SRE studies, not direct measurements from Warble production telemetry. The architectural patterns, guardrails, and calibration approach are what we'd ship (and are implementing); specific performance numbers will be published separately once we have our own 90-day window to cite.