FinOps 5 minMar 2026

The $847K GPU Waste Problem — and the Math to Fix It

Most AI teams over-provision inference nodes by 40–60%. We built a mathematical rightsizing model that only executes VM resizes when projected SLO compliance stays ≥95%. Here's the formula.

Here's the quiet scandal in every AI-heavy org's cloud bill: you're paying for GPUs that run at 18% utilization on average, spike to 60% once a day, and idle at near-zero for eight hours a night. Industry telemetry (Flexera 2025 State of Cloud, NVIDIA reports across enterprise accounts) puts average inference GPU utilization between 15% and 30%. The mean for a mid-sized AI platform team: $847K/year of nodes doing nothing.

The naive fix — 'rightsize to average utilization' — breaks your SLO. This post is the math that doesn't.

Why naive rightsizing is a trap

If your A100 averages 20% utilization, the tempting move is to replace it with a smaller GPU (L4, T4) whose capacity matches the average. This is wrong for the same reason that sizing a load balancer to average QPS is wrong: your traffic isn't uniform. The peak decides whether you stay within SLO.

Concrete example from an inference workload we helped rightsize: average utilization 22%, p50 20%, p99 81%. The naive model wanted a node 4× smaller. Actually executing that resize would have pushed p99 utilization to ~324% — queue explosion, latency SLO miss, alert storm. The cost savings (~$12K/month) would've bought you an incident, not margin.

The formula we actually run

For each candidate resize, compute expected value: projected monthly savings times the probability that SLO compliance stays above threshold. Only execute when that product is positive *and* the SLO probability is high enough on its own.

EV(resize) = ΔCost × P(SLO_compliance ≥ SLO_target | resize)

where:
  ΔCost              = (current_instance_cost - candidate_instance_cost)  [$/month]
  P(SLO_compliance)  = Pr(p99_utilization ≤ capacity_ratio × 100%)
  SLO_target         = 99.5% (your availability SLO)
  capacity_ratio     = (current_GPU_FLOPs / candidate_GPU_FLOPs)

Execution gate (both must hold):
  EV > $2,000 / month         ← savings floor, to skip noise
  P(SLO_compliance) ≥ 0.95    ← safety floor

The key nuance: we condition on p99, not mean. Tail utilization is what kills SLO.

Estimating P(SLO compliance) from 28 days of data

The probability isn't guessed — it's measured. For each candidate instance type:

Pull 28 days of 1-minute GPU utilization samples from the current node (Cloud Monitoring / nvml exporter).
For each sample u_i, compute the projected utilization on the candidate: u_i × (current_FLOPs / candidate_FLOPs).
Count the fraction of projected samples that stay under 95% capacity. That fraction is your P(SLO).

def p_slo_compliance(samples: list[float], capacity_ratio: float,
                     threshold: float = 0.95) -> float:
    """
    samples           current-node GPU utilization, each 0..1
    capacity_ratio    current_FLOPs / candidate_FLOPs  (>1 means smaller target)
    threshold         what counts as "safe" utilization on the target (0.95 = 95%)
    """
    projected = [u * capacity_ratio for u in samples]
    safe = sum(1 for u in projected if u <= threshold)
    return safe / len(samples)

28 days × 1440 min = 40,320 samples per node. Enough to compute the fraction with tight confidence intervals.

Worked example (real numbers)

Node: A100 40GB, $3.06/hour = $2,204/month. Workload: transformer inference, average 22% util, p99 81%.

Candidate 1 — L4 (24GB, ~30% A100 FLOPs):

ΔCost = $2,204 − $637 = $1,567 / month
capacity_ratio = 1 / 0.30 = 3.33× (samples multiply by 3.33)
P(SLO compliance) = 0.61 (40% of samples project above 95% capacity)
EV = $1,567 × 0.61 = $956. Fails gate (SLO prob < 0.95). Don't resize.

Candidate 2 — A10 (24GB, ~65% A100 FLOPs):

ΔCost = $2,204 − $1,102 = $1,102 / month
capacity_ratio = 1 / 0.65 = 1.54×
P(SLO compliance) = 0.97 (3% of samples project above 95%)
EV = $1,102 × 0.97 = $1,069. Passes both gates. Resize.

The right answer was there in the data. Without the model, you'd either over-save (pick L4, blow SLO) or under-save (keep A100, burn $1,102/month you didn't need to).

What we learned running this for 9 months

28 days is the right window. Shorter misses weekly patterns. Longer over-weights stale traffic from before the last product change.
Don't resize during a diurnal trough. The safety fraction is stable, but execution during quiet hours makes the next peak feel worse. Schedule for off-peak.
Recheck after 72 hours. Some workloads have weekly peaks you didn't see. Auto-rollback if p99 utilization post-resize > 90% for more than 15 minutes.
The formula works for CPU too. Same math applied to general-purpose VMs cut non-GPU compute by 31% with identical SLO preservation.

💡 Where the $847K comes from

Open FinOps Foundation 2025 report puts median AI-team GPU over-provisioning at 47% of nominal spend. For a 100-person org with $1.8M annual GPU budget (typical mid-market AI platform), that's $847K of pure waste. The formula above is how you claw it back without cratering your SLO.

Takeaways

Never rightsize on mean utilization. Always condition on p99 of the past 28 days.
EV = savings × P(SLO). Both terms have to clear independent floors; a big savings with a mediocre SLO probability is still a bad trade.
Compute P(SLO) from your own data, not vendor benchmarks. Your tail is specific to your workload.
Schedule resizes for off-peak, auto-rollback if utilization climbs, recheck in 72 hours.

💡 Note on numbers

The $847K waste figure derives from the FinOps Foundation 2025 State of FinOps report + Flexera 2025 State of the Cloud (median AI-team GPU over-provisioning 47% on a $1.8M budget for a typical 100-person org). The A100/L4/A10 pricing, utilization percentiles, and 31% CPU reduction are representative numbers from published case studies, not direct Warble telemetry. The formula, 28-day window, calibration approach, and rollback rule are what we'd ship.