GPU Cost Baseline: What to Measure, What Lies

December 29, 2025·4 min read

Professionalgpufinopscostmlopsai-infra-readiness

You cannot optimize what you cannot explain. Most GPU cost conversations start with "we need to reduce spend" but skip the harder question: what does spend actually look like, and what is driving it?

Here's how to build a GPU cost baseline that's useful for decision-making, not just budgeting.

Why GPU cost visibility matters

GPU compute is expensive and lumpy. Unlike CPU workloads where utilization roughly tracks cost, GPU billing has quirks:

Idle GPUs still cost money (reserved instances, always-on inference endpoints)
Burst capacity can 10x your bill overnight
Spot/preemptible pricing creates variance that is hard to predict
Multi-tenant scheduling hides true per-workload cost

Without a baseline, you are guessing. And when leadership asks "why did GPU spend spike 40% this month?", guessing is not a good answer.

The GPU cost visibility stack

Cost visibility requires data from multiple layers. Here is how they connect:

Four-layer stack diagram connecting cloud billing, Kubernetes scheduling, GPU utilization metrics, and application signals, with callouts for what each layer misses on its own. — **Figure 1.** Cost visibility is a join problem: each layer has blind spots until you connect them.

You need all four layers to answer: "What does this workload actually cost, and is it worth it?"

What the cloud console shows (and hides)

Cloud cost dashboards show aggregate spend by resource type, region, and time. They do not show:

Utilization vs allocation: You might pay for 8 GPUs but only use 3 at any given time
Cost per request: The unit economics of your inference workload
Idle cost: What you pay for GPUs that are provisioned but not serving traffic
Retry/waste cost: Failed requests that consumed GPU time but produced no value

The console tells you what you spent. It does not tell you whether the spend was efficient.

Example: The hidden idle cost

Consider a team running 4x A100 GPUs (80GB) on AWS:

Metric	Value
On-demand cost	$32.77/hr per GPU
Monthly cost (4 GPUs, 24/7)	~$94,500
Average utilization	35%
Effective cost per GPU-hour used	$93.63 (not $32.77)

The cloud console shows $94,500/month. It does not show that 65% of that spend is wasted on idle capacity.

Utilization vs allocation vs burst

Three metrics that matter:

Allocation: How many GPUs are provisioned (this is what you pay for)
Utilization: What percentage of GPU compute is actually used
Burst: Peak demand that exceeds steady-state allocation

Checking allocation in Kubernetes

# See GPU allocation across nodes
kubectl get nodes -l accelerator=nvidia \
  -o custom-columns=\
NAME:.metadata.name,\
ALLOCATABLE:.status.allocatable."nvidia\.com/gpu",\
CAPACITY:.status.capacity."nvidia\.com/gpu"

# See GPU requests by namespace
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) |
  [.metadata.namespace, .metadata.name,
   (.spec.containers[].resources.limits."nvidia.com/gpu" // "0")] |
  @tsv' | sort | column -t

Measuring utilization with DCGM

For NVIDIA GPUs, use dcgm-exporter to get real utilization:

# Install dcgm-exporter (Helm)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

Key Prometheus queries:

# Average GPU utilization across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL)

# GPU utilization by pod
avg by (pod) (DCGM_FI_DEV_GPU_UTIL * on(gpu) group_left(pod)
  kube_pod_info)

# Memory utilization (important for OOM prevention)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# Idle GPU hours (utilization < 5% for extended periods)
count_over_time((DCGM_FI_DEV_GPU_UTIL < 5)[1h:1m])

A healthy baseline shows:

Utilization above 60-70% on average
Burst headroom planned, not reactive
Allocation right-sized to actual demand patterns

If utilization is consistently below 40%, you are over-provisioned. If you hit burst limits frequently, you are under-provisioned or have queuing issues.

Building a cost model: spot vs reserved vs on-prem

Different procurement strategies have different cost profiles:

Strategy	Best for	Watch out for	Typical savings vs on-demand
On-demand	Unpredictable workloads, experimentation	3-5x cost vs reserved	Baseline (0%)
Reserved (1yr)	Steady-state production	Commitment risk if demand drops	30-40%
Reserved (3yr)	Long-term stable workloads	Locked in for 3 years	50-60%
Spot/Preemptible	Batch jobs, training, fault-tolerant	Interruptions, availability	60-90%
On-prem	High utilization (>60%), long-term	Upfront capex, ops overhead	50-80% (over 3yr)

Cost model template

Your cost model should answer these questions with explicit numbers:

# gpu-cost-model.yaml
workload: inference-api
period: monthly

allocation:
  gpu_type: a100-80gb
  count: 4
  hours_per_month: 720
  procurement: reserved-1yr
  unit_cost_per_hour: 22.93 # 30% discount from on-demand

utilization:
  average_percent: 65
  peak_percent: 95
  idle_hours_per_month: 180 # ~25% of time

traffic:
  requests_per_month: 12000000
  avg_tokens_per_request: 150
  p99_latency_ms: 450

unit_economics:
  cost_per_request: 0.0055 # $66,000 / 12M requests
  cost_per_1k_tokens: 0.037

scaling_assumptions:
  at_2x_load: 'need 6 GPUs, cost +50%'
  at_5x_load: 'need 12 GPUs, consider on-prem'

This is not about perfect accuracy. It is about making assumptions explicit so you can debug them.

Benchmark: what "good" looks like

Industry benchmarks for GPU cost efficiency:

Metric	Poor	Acceptable	Good	Excellent
Avg utilization	<30%	30-50%	50-70%	>70%
Idle time	>30%	15-30%	5-15%	<5%
Cost per request tracking	None	Monthly	Weekly	Real-time
Burst headroom	None	50%+ buffer	20-30% planned	Auto-scaling

If you are significantly below "acceptable", there is likely optimization opportunity. If you are at "excellent" on utilization but having reliability issues, you may be running too lean.

Common cost optimization levers

Once you have a baseline, these are the typical optimization paths:

1. Right-size allocation

# Find pods requesting more GPU memory than they use
# (Requires DCGM metrics + pod labels)

# Check if pods are using their allocated GPU
kubectl top pods -l app=inference --containers | \
  grep -v "nvidia.com/gpu: 0"

2. Shift to spot/preemptible for batch workloads

Training jobs and batch inference can often tolerate interruptions:

# Kubernetes spot/preemptible node pool
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ['spot']
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ['p4d.24xlarge', 'p3.16xlarge']

3. Implement request-based autoscaling

Scale down during low-traffic periods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: External
      external:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: '10'

4. Consider on-prem for steady-state workloads

If utilization is consistently high and workloads are stable, the math often favors on-prem. See Hybrid/On-Prem GPU: The Boring GitOps Path for implementation details.

Getting to a baseline

Start with these questions:

What did GPU-related compute cost in the last 30/60/90 days?
What is the cost per request or cost per inference call?
What percentage of GPU hours were idle?
What is your current procurement mix (on-demand, reserved, spot)?

From there, build a simple model with explicit assumptions. The goal is not perfect accuracy; it is a debuggable baseline you can iterate on.

Quick baseline checklist

GPU allocation visible per namespace/workload
Utilization metrics collected (DCGM or equivalent)
Cost per request calculable (even if rough)
Idle time quantified
Procurement strategy documented
Scaling assumptions explicit

Next steps: If you want help building a cost baseline and identifying optimization opportunities, the AI Infra Readiness Audit includes a full GPU cost model as a core deliverable.

Related reading:

SLOs for Inference - Reliability metrics that complement cost metrics
GPU Failure Modes - When failures drive up cost
AI Infra Readiness Audit checklist - The full diagnostic framework

3 min read

gpumlops

AI Infra Readiness Audit: What I Check (and What You Get)

A practical checklist for auditing production AI infrastructure: GPU cost baselines, reliability risks, and an executable roadmap.

4 min read

mlopsai-infra-readiness

SLOs for Inference: Latency, Errors, Saturation

How to define meaningful SLOs for production inference workloads, and what to do when they break.

5 min read

gpuai-infra-readiness

GPU Failure Modes: What Breaks and How to Debug It

Common GPU infrastructure failures in production and how to diagnose them before they become incidents.

Comments

Join the discussion. Be respectful.