MTTR cut from 4.2 hrs to 58 min
Deployed Reflexion Engine on Vertex AI Agent Engine. Actor/Critic loops auto-remediated 63% of known incident patterns before on-call was paged. Human-in-the-loop gate engaged on 4 blast-radius events — zero false positives.
On-call hour reduction alone covered Reflexion seat costs in week 6.
The team was running a single-region 80-node GKE cluster with a four-engineer on-call rotation. Pages averaged 11 a week, and the median page-to-resolution time was 4.2 hours, dominated by hypothesis-formation latency at 2 a.m.
We deployed Reflexion Engine in front of their existing PagerDuty rotation. Starling MCP gave the Engine read-only access to cluster state; Brain stored 90 days of historical incidents seeded from their post-mortem repo.
Within six weeks, 63% of paging events were auto-remediated by Actor/Critic loops before the human on-call was woken. The remaining 37% reached on-call with a draft fix, a confidence score, and a one-click apply. Median MTTR for the auto-remediated set fell to 58 minutes, dominated by the SLO-recovery watch period.
Across the pilot, the Critic rejected 14 Actor proposals, two of which would have caused secondary incidents — captured before they applied. Zero false-positive auto-actions reached production.
We woke up to a Slack message that an incident had been resolved. That had never happened before.
Could this be your team?
Every engagement starts with a 30-minute scoping call. We'll walk through the data shape, blast-radius constraints, and which tier of advisory or product fits.