Skip to main content
Back to Blog

GPU Cost Baseline: What to Measure, What Lies

December 29, 2025·4 min read

Professionalgpufinopscostmlopsai-infra-readiness

You cannot optimize what you cannot explain. Most GPU cost conversations start with "we need to reduce spend" but skip the harder question: what does spend actually look like, and what is driving it?

Here's how to build a GPU cost baseline that's useful for decision-making, not just budgeting.

Why GPU cost visibility matters

GPU compute is expensive and lumpy. Unlike CPU workloads where utilization roughly tracks cost, GPU billing has quirks:

  • Idle GPUs still cost money (reserved instances, always-on inference endpoints)
  • Burst capacity can 10x your bill overnight
  • Spot/preemptible pricing creates variance that is hard to predict
  • Multi-tenant scheduling hides true per-workload cost

Without a baseline, you are guessing. And when leadership asks "why did GPU spend spike 40% this month?", guessing is not a good answer.

The GPU cost visibility stack

Cost visibility requires data from multiple layers. Here is how they connect:

Four-layer stack diagram connecting cloud billing, Kubernetes scheduling, GPU utilization metrics, and application signals, with callouts for what each layer misses on its own.
Figure 1. Cost visibility is a join problem: each layer has blind spots until you connect them.

You need all four layers to answer: "What does this workload actually cost, and is it worth it?"

What the cloud console shows (and hides)

Cloud cost dashboards show aggregate spend by resource type, region, and time. They do not show:

  • Utilization vs allocation: You might pay for 8 GPUs but only use 3 at any given time
  • Cost per request: The unit economics of your inference workload
  • Idle cost: What you pay for GPUs that are provisioned but not serving traffic
  • Retry/waste cost: Failed requests that consumed GPU time but produced no value

The console tells you what you spent. It does not tell you whether the spend was efficient.

Example: The hidden idle cost

Consider a team running 4x A100 GPUs (80GB) on AWS:

MetricValue
On-demand cost$32.77/hr per GPU
Monthly cost (4 GPUs, 24/7)~$94,500
Average utilization35%
Effective cost per GPU-hour used$93.63 (not $32.77)

The cloud console shows $94,500/month. It does not show that 65% of that spend is wasted on idle capacity.

Utilization vs allocation vs burst

Three metrics that matter:

  1. Allocation: How many GPUs are provisioned (this is what you pay for)
  2. Utilization: What percentage of GPU compute is actually used
  3. Burst: Peak demand that exceeds steady-state allocation

Checking allocation in Kubernetes

# See GPU allocation across nodes
kubectl get nodes -l accelerator=nvidia \
  -o custom-columns=\
NAME:.metadata.name,\
ALLOCATABLE:.status.allocatable."nvidia\.com/gpu",\
CAPACITY:.status.capacity."nvidia\.com/gpu"

# See GPU requests by namespace
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) |
  [.metadata.namespace, .metadata.name,
   (.spec.containers[].resources.limits."nvidia.com/gpu" // "0")] |
  @tsv' | sort | column -t

Measuring utilization with DCGM

For NVIDIA GPUs, use dcgm-exporter to get real utilization:

# Install dcgm-exporter (Helm)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

Key Prometheus queries:

# Average GPU utilization across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL)

# GPU utilization by pod
avg by (pod) (DCGM_FI_DEV_GPU_UTIL * on(gpu) group_left(pod)
  kube_pod_info)

# Memory utilization (important for OOM prevention)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# Idle GPU hours (utilization < 5% for extended periods)
count_over_time((DCGM_FI_DEV_GPU_UTIL < 5)[1h:1m])

A healthy baseline shows:

  • Utilization above 60-70% on average
  • Burst headroom planned, not reactive
  • Allocation right-sized to actual demand patterns

If utilization is consistently below 40%, you are over-provisioned. If you hit burst limits frequently, you are under-provisioned or have queuing issues.

Building a cost model: spot vs reserved vs on-prem

Different procurement strategies have different cost profiles:

StrategyBest forWatch out forTypical savings vs on-demand
On-demandUnpredictable workloads, experimentation3-5x cost vs reservedBaseline (0%)
Reserved (1yr)Steady-state productionCommitment risk if demand drops30-40%
Reserved (3yr)Long-term stable workloadsLocked in for 3 years50-60%
Spot/PreemptibleBatch jobs, training, fault-tolerantInterruptions, availability60-90%
On-premHigh utilization (>60%), long-termUpfront capex, ops overhead50-80% (over 3yr)

Cost model template

Your cost model should answer these questions with explicit numbers:

# gpu-cost-model.yaml
workload: inference-api
period: monthly

allocation:
  gpu_type: a100-80gb
  count: 4
  hours_per_month: 720
  procurement: reserved-1yr
  unit_cost_per_hour: 22.93 # 30% discount from on-demand

utilization:
  average_percent: 65
  peak_percent: 95
  idle_hours_per_month: 180 # ~25% of time

traffic:
  requests_per_month: 12000000
  avg_tokens_per_request: 150
  p99_latency_ms: 450

unit_economics:
  cost_per_request: 0.0055 # $66,000 / 12M requests
  cost_per_1k_tokens: 0.037

scaling_assumptions:
  at_2x_load: 'need 6 GPUs, cost +50%'
  at_5x_load: 'need 12 GPUs, consider on-prem'

This is not about perfect accuracy. It is about making assumptions explicit so you can debug them.

Benchmark: what "good" looks like

Industry benchmarks for GPU cost efficiency:

MetricPoorAcceptableGoodExcellent
Avg utilization<30%30-50%50-70%>70%
Idle time>30%15-30%5-15%<5%
Cost per request trackingNoneMonthlyWeeklyReal-time
Burst headroomNone50%+ buffer20-30% plannedAuto-scaling

If you are significantly below "acceptable", there is likely optimization opportunity. If you are at "excellent" on utilization but having reliability issues, you may be running too lean.

Common cost optimization levers

Once you have a baseline, these are the typical optimization paths:

1. Right-size allocation

# Find pods requesting more GPU memory than they use
# (Requires DCGM metrics + pod labels)

# Check if pods are using their allocated GPU
kubectl top pods -l app=inference --containers | \
  grep -v "nvidia.com/gpu: 0"

2. Shift to spot/preemptible for batch workloads

Training jobs and batch inference can often tolerate interruptions:

# Kubernetes spot/preemptible node pool
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ['spot']
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ['p4d.24xlarge', 'p3.16xlarge']

3. Implement request-based autoscaling

Scale down during low-traffic periods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: External
      external:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: '10'

4. Consider on-prem for steady-state workloads

If utilization is consistently high and workloads are stable, the math often favors on-prem. See Hybrid/On-Prem GPU: The Boring GitOps Path for implementation details.

Getting to a baseline

Start with these questions:

  1. What did GPU-related compute cost in the last 30/60/90 days?
  2. What is the cost per request or cost per inference call?
  3. What percentage of GPU hours were idle?
  4. What is your current procurement mix (on-demand, reserved, spot)?

From there, build a simple model with explicit assumptions. The goal is not perfect accuracy; it is a debuggable baseline you can iterate on.

Quick baseline checklist

  • GPU allocation visible per namespace/workload
  • Utilization metrics collected (DCGM or equivalent)
  • Cost per request calculable (even if rough)
  • Idle time quantified
  • Procurement strategy documented
  • Scaling assumptions explicit

Next steps: If you want help building a cost baseline and identifying optimization opportunities, the AI Infra Readiness Audit includes a full GPU cost model as a core deliverable.

Related reading:

Related Articles

Comments

Join the discussion. Be respectful.