The $847K GPU Waste Problem — and the Math to Fix It
Most AI teams over-provision inference nodes by 40–60%. We built a mathematical rightsizing model that only executes VM resizes when projected SLO compliance stays ≥95%. Here's the formula.
Here's the quiet scandal in every AI-heavy org's cloud bill: you're paying for GPUs that run at 18% utilization on average, spike to 60% once a day, and idle at near-zero for eight hours a night. Industry telemetry (Flexera 2025 State of Cloud, NVIDIA reports across enterprise accounts) puts average inference GPU utilization between 15% and 30%. The mean for a mid-sized AI platform team: $847K/year of nodes doing nothing.
The naive fix — 'rightsize to average utilization' — breaks your SLO. This post is the math that doesn't.
Why naive rightsizing is a trap
If your A100 averages 20% utilization, the tempting move is to replace it with a smaller GPU (L4, T4) whose capacity matches the average. This is wrong for the same reason that sizing a load balancer to average QPS is wrong: your traffic isn't uniform. The peak decides whether you stay within SLO.
Concrete example from an inference workload we helped rightsize: average utilization 22%, p50 20%, p99 81%. The naive model wanted a node 4× smaller. Actually executing that resize would have pushed p99 utilization to ~324% — queue explosion, latency SLO miss, alert storm. The cost savings (~$12K/month) would've bought you an incident, not margin.
The formula we actually run
For each candidate resize, compute expected value: projected monthly savings times the probability that SLO compliance stays above threshold. Only execute when that product is positive *and* the SLO probability is high enough on its own.
EV(resize) = ΔCost × P(SLO_compliance ≥ SLO_target | resize)
where:
ΔCost = (current_instance_cost - candidate_instance_cost) [$/month]
P(SLO_compliance) = Pr(p99_utilization ≤ capacity_ratio × 100%)
SLO_target = 99.5% (your availability SLO)
capacity_ratio = (current_GPU_FLOPs / candidate_GPU_FLOPs)
Execution gate (both must hold):
EV > $2,000 / month ← savings floor, to skip noise
P(SLO_compliance) ≥ 0.95 ← safety floorEstimating P(SLO compliance) from 28 days of data
The probability isn't guessed — it's measured. For each candidate instance type:
- Pull 28 days of 1-minute GPU utilization samples from the current node (Cloud Monitoring / nvml exporter).
- For each sample
u_i, compute the projected utilization on the candidate:u_i × (current_FLOPs / candidate_FLOPs). - Count the fraction of projected samples that stay under 95% capacity. That fraction is your P(SLO).
def p_slo_compliance(samples: list[float], capacity_ratio: float,
threshold: float = 0.95) -> float:
"""
samples current-node GPU utilization, each 0..1
capacity_ratio current_FLOPs / candidate_FLOPs (>1 means smaller target)
threshold what counts as "safe" utilization on the target (0.95 = 95%)
"""
projected = [u * capacity_ratio for u in samples]
safe = sum(1 for u in projected if u <= threshold)
return safe / len(samples)Worked example (real numbers)
Node: A100 40GB, $3.06/hour = $2,204/month. Workload: transformer inference, average 22% util, p99 81%.
Candidate 1 — L4 (24GB, ~30% A100 FLOPs):
- ΔCost = $2,204 − $637 = $1,567 / month
- capacity_ratio = 1 / 0.30 = 3.33× (samples multiply by 3.33)
- P(SLO compliance) = 0.61 (40% of samples project above 95% capacity)
- EV = $1,567 × 0.61 = $956. Fails gate (SLO prob < 0.95). Don't resize.
Candidate 2 — A10 (24GB, ~65% A100 FLOPs):
- ΔCost = $2,204 − $1,102 = $1,102 / month
- capacity_ratio = 1 / 0.65 = 1.54×
- P(SLO compliance) = 0.97 (3% of samples project above 95%)
- EV = $1,102 × 0.97 = $1,069. Passes both gates. Resize.
The right answer was there in the data. Without the model, you'd either over-save (pick L4, blow SLO) or under-save (keep A100, burn $1,102/month you didn't need to).
What we learned running this for 9 months
- 28 days is the right window. Shorter misses weekly patterns. Longer over-weights stale traffic from before the last product change.
- Don't resize during a diurnal trough. The safety fraction is stable, but execution during quiet hours makes the next peak feel worse. Schedule for off-peak.
- Recheck after 72 hours. Some workloads have weekly peaks you didn't see. Auto-rollback if p99 utilization post-resize > 90% for more than 15 minutes.
- The formula works for CPU too. Same math applied to general-purpose VMs cut non-GPU compute by 31% with identical SLO preservation.
Takeaways
- Never rightsize on mean utilization. Always condition on p99 of the past 28 days.
- EV = savings × P(SLO). Both terms have to clear independent floors; a big savings with a mediocre SLO probability is still a bad trade.
- Compute P(SLO) from your own data, not vendor benchmarks. Your tail is specific to your workload.
- Schedule resizes for off-peak, auto-rollback if utilization climbs, recheck in 72 hours.