GPU Cost Baseline: What to Measure, What Lies
December 29, 2025·4 min read
You cannot optimize what you cannot explain. Most GPU cost conversations start with "we need to reduce spend" but skip the harder question: what does spend actually look like, and what is driving it?
Here's how to build a GPU cost baseline that's useful for decision-making, not just budgeting.
Why GPU cost visibility matters
GPU compute is expensive and lumpy. Unlike CPU workloads where utilization roughly tracks cost, GPU billing has quirks:
- Idle GPUs still cost money (reserved instances, always-on inference endpoints)
- Burst capacity can 10x your bill overnight
- Spot/preemptible pricing creates variance that is hard to predict
- Multi-tenant scheduling hides true per-workload cost
Without a baseline, you are guessing. And when leadership asks "why did GPU spend spike 40% this month?", guessing is not a good answer.
The GPU cost visibility stack
Cost visibility requires data from multiple layers. Here is how they connect:
You need all four layers to answer: "What does this workload actually cost, and is it worth it?"
What the cloud console shows (and hides)
Cloud cost dashboards show aggregate spend by resource type, region, and time. They do not show:
- Utilization vs allocation: You might pay for 8 GPUs but only use 3 at any given time
- Cost per request: The unit economics of your inference workload
- Idle cost: What you pay for GPUs that are provisioned but not serving traffic
- Retry/waste cost: Failed requests that consumed GPU time but produced no value
The console tells you what you spent. It does not tell you whether the spend was efficient.
Example: The hidden idle cost
Consider a team running 4x A100 GPUs (80GB) on AWS:
| Metric | Value |
|---|---|
| On-demand cost | $32.77/hr per GPU |
| Monthly cost (4 GPUs, 24/7) | ~$94,500 |
| Average utilization | 35% |
| Effective cost per GPU-hour used | $93.63 (not $32.77) |
The cloud console shows $94,500/month. It does not show that 65% of that spend is wasted on idle capacity.
Utilization vs allocation vs burst
Three metrics that matter:
- Allocation: How many GPUs are provisioned (this is what you pay for)
- Utilization: What percentage of GPU compute is actually used
- Burst: Peak demand that exceeds steady-state allocation
Checking allocation in Kubernetes
# See GPU allocation across nodes
kubectl get nodes -l accelerator=nvidia \
-o custom-columns=\
NAME:.metadata.name,\
ALLOCATABLE:.status.allocatable."nvidia\.com/gpu",\
CAPACITY:.status.capacity."nvidia\.com/gpu"
# See GPU requests by namespace
kubectl get pods -A -o json | jq -r '
.items[] |
select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) |
[.metadata.namespace, .metadata.name,
(.spec.containers[].resources.limits."nvidia.com/gpu" // "0")] |
@tsv' | sort | column -t
Measuring utilization with DCGM
For NVIDIA GPUs, use dcgm-exporter to get real utilization:
# Install dcgm-exporter (Helm)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--set serviceMonitor.enabled=true
Key Prometheus queries:
# Average GPU utilization across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL)
# GPU utilization by pod
avg by (pod) (DCGM_FI_DEV_GPU_UTIL * on(gpu) group_left(pod)
kube_pod_info)
# Memory utilization (important for OOM prevention)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100
# Idle GPU hours (utilization < 5% for extended periods)
count_over_time((DCGM_FI_DEV_GPU_UTIL < 5)[1h:1m])
A healthy baseline shows:
- Utilization above 60-70% on average
- Burst headroom planned, not reactive
- Allocation right-sized to actual demand patterns
If utilization is consistently below 40%, you are over-provisioned. If you hit burst limits frequently, you are under-provisioned or have queuing issues.
Building a cost model: spot vs reserved vs on-prem
Different procurement strategies have different cost profiles:
| Strategy | Best for | Watch out for | Typical savings vs on-demand |
|---|---|---|---|
| On-demand | Unpredictable workloads, experimentation | 3-5x cost vs reserved | Baseline (0%) |
| Reserved (1yr) | Steady-state production | Commitment risk if demand drops | 30-40% |
| Reserved (3yr) | Long-term stable workloads | Locked in for 3 years | 50-60% |
| Spot/Preemptible | Batch jobs, training, fault-tolerant | Interruptions, availability | 60-90% |
| On-prem | High utilization (>60%), long-term | Upfront capex, ops overhead | 50-80% (over 3yr) |
Cost model template
Your cost model should answer these questions with explicit numbers:
# gpu-cost-model.yaml
workload: inference-api
period: monthly
allocation:
gpu_type: a100-80gb
count: 4
hours_per_month: 720
procurement: reserved-1yr
unit_cost_per_hour: 22.93 # 30% discount from on-demand
utilization:
average_percent: 65
peak_percent: 95
idle_hours_per_month: 180 # ~25% of time
traffic:
requests_per_month: 12000000
avg_tokens_per_request: 150
p99_latency_ms: 450
unit_economics:
cost_per_request: 0.0055 # $66,000 / 12M requests
cost_per_1k_tokens: 0.037
scaling_assumptions:
at_2x_load: 'need 6 GPUs, cost +50%'
at_5x_load: 'need 12 GPUs, consider on-prem'
This is not about perfect accuracy. It is about making assumptions explicit so you can debug them.
Benchmark: what "good" looks like
Industry benchmarks for GPU cost efficiency:
| Metric | Poor | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Avg utilization | <30% | 30-50% | 50-70% | >70% |
| Idle time | >30% | 15-30% | 5-15% | <5% |
| Cost per request tracking | None | Monthly | Weekly | Real-time |
| Burst headroom | None | 50%+ buffer | 20-30% planned | Auto-scaling |
If you are significantly below "acceptable", there is likely optimization opportunity. If you are at "excellent" on utilization but having reliability issues, you may be running too lean.
Common cost optimization levers
Once you have a baseline, these are the typical optimization paths:
1. Right-size allocation
# Find pods requesting more GPU memory than they use
# (Requires DCGM metrics + pod labels)
# Check if pods are using their allocated GPU
kubectl top pods -l app=inference --containers | \
grep -v "nvidia.com/gpu: 0"
2. Shift to spot/preemptible for batch workloads
Training jobs and batch inference can often tolerate interruptions:
# Kubernetes spot/preemptible node pool
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['spot']
- key: node.kubernetes.io/instance-type
operator: In
values: ['p4d.24xlarge', 'p3.16xlarge']
3. Implement request-based autoscaling
Scale down during low-traffic periods:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 1
maxReplicas: 8
metrics:
- type: External
external:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: '10'
4. Consider on-prem for steady-state workloads
If utilization is consistently high and workloads are stable, the math often favors on-prem. See Hybrid/On-Prem GPU: The Boring GitOps Path for implementation details.
Getting to a baseline
Start with these questions:
- What did GPU-related compute cost in the last 30/60/90 days?
- What is the cost per request or cost per inference call?
- What percentage of GPU hours were idle?
- What is your current procurement mix (on-demand, reserved, spot)?
From there, build a simple model with explicit assumptions. The goal is not perfect accuracy; it is a debuggable baseline you can iterate on.
Quick baseline checklist
- GPU allocation visible per namespace/workload
- Utilization metrics collected (DCGM or equivalent)
- Cost per request calculable (even if rough)
- Idle time quantified
- Procurement strategy documented
- Scaling assumptions explicit
Next steps: If you want help building a cost baseline and identifying optimization opportunities, the AI Infra Readiness Audit includes a full GPU cost model as a core deliverable.
Related reading:
- SLOs for Inference - Reliability metrics that complement cost metrics
- GPU Failure Modes - When failures drive up cost
- AI Infra Readiness Audit checklist - The full diagnostic framework
Related Articles
Comments
Join the discussion. Be respectful.