Agentic SRESeries B SaaS · 80-node GKE cluster

MTTR cut from 4.2 hrs to 58 min

Deployed Reflexion Engine on Vertex AI Agent Engine. Actor/Critic loops auto-remediated 63% of known incident patterns before on-call was paged. Human-in-the-loop gate engaged on 4 blast-radius events — zero false positives.

58 min

4.2 hrs

MTTR

63%

Auto-remediated

< 10 sec

45 min

First response

Payback · 2 months

On-call hour reduction alone covered Reflexion seat costs in week 6.

The team was running a single-region 80-node GKE cluster with a four-engineer on-call rotation. Pages averaged 11 a week, and the median page-to-resolution time was 4.2 hours, dominated by hypothesis-formation latency at 2 a.m.

We deployed Reflexion Engine in front of their existing PagerDuty rotation. Starling MCP gave the Engine read-only access to cluster state; Brain stored 90 days of historical incidents seeded from their post-mortem repo.

Within six weeks, 63% of paging events were auto-remediated by Actor/Critic loops before the human on-call was woken. The remaining 37% reached on-call with a draft fix, a confidence score, and a one-click apply. Median MTTR for the auto-remediated set fell to 58 minutes, dominated by the SLO-recovery watch period.

Across the pilot, the Critic rejected 14 Actor proposals, two of which would have caused secondary incidents — captured before they applied. Zero false-positive auto-actions reached production.

We woke up to a Slack message that an incident had been resolved. That had never happened before.
— Staff SRE, Series B SaaS

Could this be your team?

Every engagement starts with a 30-minute scoping call. We'll walk through the data shape, blast-radius constraints, and which tier of advisory or product fits.

Book the call See more case studies

$50K/mo cloud waste eliminated