Agentic Ops 9 minFeb 2026

ShrikeOps Deep-Dive: MCP + Kubernetes

Model Context Protocol opens a new primitive for AI-to-infrastructure communication. We explore how MCP bridges Claude to live cluster state — and what it means for the future of Agentic Ops.

Two things happened in the last year that together flipped the Kubernetes-operator playbook on its head. First, Anthropic published the Model Context Protocol — a JSON-RPC primitive for letting an LLM call tools over a duplex channel. Second, Kubernetes client-go finally got stable enough that wrapping it with JSON-RPC became a weekend project, not a quarter of work.

This post is what happened when we put those two together into production. The product is Starling. The lesson is that Model Context Protocol for infrastructure isn't an incremental improvement over ChatOps — it's a genuinely new primitive.

The before: ChatOps with a log aggregator

Here's the 2024 playbook for "SRE with an LLM": install a log aggregator, train the engineer to copy-paste incident context into Claude, pray the LLM's suggestion doesn't require a kubectl command the engineer doesn't have permission to run. Mean time to remediation: two humans in a loop, bounded by typing speed.

The problem is that the LLM is *downstream* of the engineer. Every piece of context has to pass through a human. The LLM can't see what it hasn't been shown. It can't ask follow-up questions about state it doesn't have.

What MCP changes

MCP flips the dataflow. The LLM becomes the one *asking* for context, calling tools directly, iterating until it's done. No human in the transcription loop. For Kubernetes, the tool surface looks like this:

// Starling MCP tool catalog — abridged
{
  "list_pods":        "namespace, label_selector → []Pod",
  "get_pod":          "namespace, name → PodStatus",
  "get_pod_logs":     "namespace, name, container, tail_lines → string",
  "list_events":      "namespace, field_selector → []Event",
  "list_deployments": "namespace → []Deployment",
  "list_hpas":        "namespace → []HPA",
  "cluster_summary":  "namespace? → NodeHealth + PhaseCounts + Warnings",
  "scan_manifest":    "manifest yaml → {score, grade, findings}",
  "scan_cluster":     "kubeconfig yaml → PostureReport"
}
17 tools total. Each one is a thin wrapper over client-go or shrikeops-scanner.

Claude decides which tool to call, when, and how many times. We don't script the remediation — we ship the tools and let the reasoning loop drive.

Why JSON-RPC over SSH / kubectl exec / webhook

We evaluated four transports before picking MCP. Here's the audit trail:

  • SSH + kubectl — gives the LLM full cluster-admin. No graceful authz boundary. Auditable only via bash history. Rejected.
  • Webhook per-tool — every new capability requires a new endpoint. LLM has to discover them via docs. Scales badly.
  • Raw k8s API — LLM emits raw REST; error surface is structured but the tool surface is enormous (500+ resources × CRUD). Too wide.
  • MCP (JSON-RPC 2.0 + tool manifest) — narrow tool surface we control, discovery baked in, transport-agnostic (stdio + HTTP). Winner.

The interesting parts of the implementation

One handler, two transports

Claude Desktop speaks stdio. Hosted agents speak HTTP. If you write two handlers, they drift — we learned this in two weeks. The fix: one Handler.Dispatch(ctx, req) called from both.

func (s *Server) ServeStdio() {
    // readLine → Dispatch → writeLine
}
func (s *Server) ServeHTTP(addr string) {
    http.HandleFunc("/mcp", func(w, r) {
        outcome := s.keys.check(r.Header["Authorization"], costOf(req))
        if !outcome.Allow { w.WriteHeader(outcome.Status); return }
        resp := s.Dispatch(ctx, req)   // same call as stdio
        json.NewEncoder(w).Encode(resp)
    })
}
Auth + metering is middleware on the HTTP transport. Stdio inherits the kubeconfig's trust boundary — no metering needed.

Ephemeral kubeconfigs for customer clusters

The killer MCP feature for a SaaS is scan_cluster — user pastes a kubeconfig, we scan their cluster, return a posture report. Never persist the kubeconfig.

func ScanCluster(ctx context.Context, kubeconfigYAML string) (*Report, error) {
    cfg, _ := clientcmd.RESTConfigFromKubeConfig([]byte(kubeconfigYAML))
    client := kubernetes.NewForConfigOrDie(cfg)
    defer func() { client = nil }()   // GC-visible; signals "lifecycle matters"
    return runChecks(ctx, client)
}
Grep our repo for 'ioutil.WriteFile' near anything kubeconfig — comes up empty. That's the audit.

Per-tool credit metering

Not every tool call has the same cost to us. list_pods is a single API call. scan_cluster spins up an ephemeral client, runs 30+ posture checks, stays under a 25-second wall-clock budget. So our pricing reflects it:

var costs = map[string]int{
    "list_pods":        1,
    "get_pod_logs":     2,
    "cluster_summary":  2,
    "scan_manifest":    3,
    "scan_cluster":    10,
    "scale_deployment": 5,
}
Credit debit is atomic in a Firestore transaction. Concurrent calls on the same key don't race.

What the LLM does differently when it has MCP

We ran the same incident through ChatOps and MCP-enabled Claude, twice. The differences:

  • ChatOps: 6 messages, 3 human-typed context dumps, 4:30 minutes total.
  • MCP: 1 human message ("why is ingest-worker CrashLooping?"). Claude then called get_pod, saw OOMKilled, called get_pod_logs, read the stack, called list_events, confirmed the node wasn't under pressure. 0:55 total. Zero human transcription.

The 5x speedup isn't the interesting part. The interesting part is that Claude asked a follow-up question we wouldn't have thought to answer: it checked node memory pressure to distinguish "pod greedy" from "node oversubscribed". A pre-scripted runbook wouldn't have. ChatOps with a human in the loop wouldn't have bothered.

Where MCP has teeth that you don't immediately see

  • Discoverability: the tool manifest is the documentation. New tools are self-describing. No training the LLM on new endpoints.
  • Least privilege per tool: each tool maps to a specific k8s RBAC verb. "Read pods" doesn't get you "delete pods". The LLM can't elevate.
  • Audit log is native: every tool call is a structured JSON-RPC message. Grep for "method":"tools/call","params":{"name":"delete_pod" and you have a complete admin trail.
  • Cost attribution: per-call credits mean a noisy agent has a visible bill. We've found misbehaving agents (polling list_pods every second) via cost anomalies, not alerts.

What MCP doesn't solve (yet)

Honest about the gaps:

  • Approval flows for write operations: the spec has no standard for "LLM wants to delete_pod — prompt human". We do it out-of-band via a secondary channel. Clunky.
  • Streaming tool output: MCP's JSON-RPC is request/response. Tailing logs means polling or chunked responses. Not native.
  • Cross-server context: if the LLM talks to two MCP servers, there's no standard for them sharing state. Each is an island.
💡 Try Starling
Install: curl -L https://github.com/warble-tech/starling/releases/latest/download/starling_0.1.0_linux_amd64.tar.gz | tar xz. Then starling login → point your Claude Desktop at it → ask your cluster something.

Takeaways

  1. MCP is a new primitive, not a ChatOps upgrade. The LLM becomes the orchestrator, not a text endpoint.
  2. Narrow tool surfaces beat broad ones. 17 tools we curated beats giving Claude raw kubectl.
  3. Same handler, two transports. Don't fork stdio + HTTP paths.
  4. Meter per-tool. Costs vary 10× across a reasonable tool set; uniform pricing is either too cheap for heavy tools or too expensive for light ones.

Starling releases: github.com/warble-tech/starling. Install with one curl line. Issues + feature requests welcome in the tracker.

Engineering deep-dives, not thought leadership

Get the next post

One email per post. No digests, no summaries.

Subscribe via contact form