GPU Failure Modes: What Breaks and How to Debug It

December 29, 2025·5 min read

Professionalreliabilitygpudebuggingkubernetesai-infra-readiness

GPU infrastructure fails differently than CPU infrastructure. The failure modes are less familiar, the debugging tools are less mature, and the blast radius is often larger (GPU nodes are expensive, so you have fewer of them).

Here's a field guide to the failures I see most often and how to debug them.

Diagnostic decision tree

When GPU workloads fail, start here:

Decision tree for GPU incidents: start with whether the pod is stuck pending, then distinguish runtime errors from performance degradations. — **Figure 1.** Start with scheduling vs runtime vs performance. It narrows the search space quickly.

Quick diagnostic commands:

# Step 1: Check pod status
kubectl get pods -l app=inference -o wide

# Step 2: Check pod events (scheduling issues show here)
kubectl describe pod <pod-name> | tail -20

# Step 3: Check GPU node status
kubectl get nodes -l accelerator=nvidia -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
GPU:.status.allocatable.\"nvidia\\.com/gpu\"

# Step 4: Check GPU health (NVIDIA)
kubectl exec -it <pod-on-gpu-node> -- nvidia-smi

# Step 5: Check recent logs
kubectl logs <pod-name> --tail=100

Driver and runtime failures (CUDA/ROCm)

The GPU driver stack has multiple layers, and each can fail:

Layered GPU driver stack from application down to hardware, with callouts for where version mismatches, driver crashes, and hardware errors tend to surface. — **Figure 2.** Most “mystery” GPU failures are boundary failures: runtime ↔ driver ↔ kernel.

What you will see:

CUDA error: no kernel image is available for execution
ROCm runtime error: invalid device ordinal
CUDA error: CUDA driver version is insufficient for CUDA runtime version
Pods stuck in CrashLoopBackOff with GPU-related errors
Silent failures where GPU is "available" but not functional

Root causes:

Error pattern	Likely cause	Fix
`no kernel image available`	CUDA compute capability mismatch	Use correct base image
`driver version insufficient`	Host driver too old	Update host driver
`invalid device ordinal`	GPU not visible to container	Check device plugin
`CUDA initialization error`	Driver crash or GPU hang	Restart node or GPU reset

Debugging driver issues:

# Check host driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# Check CUDA runtime version in container
kubectl exec -it <pod> -- nvcc --version

# Check for driver errors in kernel log
kubectl debug node/<gpu-node> -it --image=busybox -- \
  cat /host/var/log/dmesg | grep -i "nvidia\|nvrm\|xid"

# Check GPU error codes (XID errors indicate hardware/driver issues)
nvidia-smi -q | grep -i "Xid\|Error"

# For AMD: Check ROCm version and device status
rocm-smi --showdriverversion
rocm-smi --showhw

Common XID error codes (NVIDIA):

XID	Meaning	Action
13	Graphics engine exception	Check for OOM or driver bug
31	GPU memory page fault	Usually OOM, check memory usage
43	GPU stopped responding	GPU hang, needs reset
45	Preemptive cleanup	Normal during driver recovery
48	DBE (double-bit error)	Hardware failure, replace GPU
63	ECC page retirement	Too many ECC errors, replace GPU
79	GPU off the bus	PCIe issue, check physical connection

Prevention strategies:

# Add driver version label to GPU nodes
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-01
  labels:
    nvidia.com/cuda.driver.major: '535'
    nvidia.com/cuda.driver.minor: '129'
    nvidia.com/cuda.driver.rev: '03'

---
# Use node selector to ensure driver compatibility
apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    nvidia.com/cuda.driver.major: '535'
  containers:
    - name: inference
      image: vllm/vllm-openai:v0.4.0 # Built for CUDA 12.1

OOM: GPU memory vs system memory

GPU OOM is the most common inference failure, but it is often misdiagnosed. There are two distinct OOM scenarios:

Two-column diagram contrasting system memory (RAM) versus GPU memory (VRAM) and how out-of-memory failures present in Kubernetes. — **Figure 3.** RAM OOMs are Kubernetes-visible; VRAM OOMs often look like “random” runtime crashes.

GPU OOM symptoms:

CUDA error: out of memory
torch.cuda.OutOfMemoryError
Process crashes but pod shows Running (not OOMKilled)
Inference latency spikes before crash (memory pressure)

System OOM symptoms:

Pod status shows OOMKilled
Container restarts with exit code 137
Often happens during large batch processing

Debugging memory issues:

# Check current GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total,memory.free \
  --format=csv,noheader,nounits

# Monitor GPU memory over time
watch -n 1 'nvidia-smi --query-gpu=memory.used --format=csv,noheader'

# Check memory allocation by process
nvidia-smi pmon -c 1

# For detailed CUDA memory accounting in PyTorch:
# Add this to your container
kubectl exec -it <pod> -- python3 -c "
import torch
print(torch.cuda.memory_summary())
"

# Check system memory
kubectl top pod <pod-name>
kubectl describe pod <pod-name> | grep -A10 "Last State"

# For AMD GPUs
rocm-smi --showmeminfo vram
rocm-smi --showmemuse

Memory calculation for LLMs:

Model memory ≈ (parameters × bytes_per_param) + overhead

Examples (FP16):
- 7B model:   7B × 2 bytes = ~14GB
- 13B model: 13B × 2 bytes = ~26GB
- 70B model: 70B × 2 bytes = ~140GB (needs multi-GPU)

KV cache per request ≈ 2 × layers × heads × dim × context_length × 2 bytes

Example (Llama 7B, 4096 context):
- 2 × 32 × 32 × 128 × 4096 × 2 = ~2GB per concurrent request

Prevention and fixes:

# Set environment variables for better GPU memory management
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      env:
        # PyTorch memory management
        - name: PYTORCH_CUDA_ALLOC_CONF
          value: 'expandable_segments:True,max_split_size_mb:128'
        # Limit CUDA memory usage (reserve 1GB for system)
        - name: CUDA_MEMORY_FRACTION
          value: '0.95'
        # For vLLM: limit KV cache
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: '0.85'
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: 32Gi # System memory limit
        requests:
          memory: 24Gi

Quantization to reduce memory:

Precision	Memory per 7B	Memory per 70B	Accuracy loss
FP32	~28GB	~280GB	Baseline
FP16	~14GB	~140GB	Negligible
INT8	~7GB	~70GB	Minor
INT4	~3.5GB	~35GB	Moderate

Scheduling failures (pods stuck pending)

GPU scheduling in Kubernetes has more failure modes than CPU scheduling because GPUs are scarce, expensive, and often have special node configurations.

What you will see:

Pod stuck in Pending state indefinitely
Events: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu
Events: 0/5 nodes are available: 5 node(s) had untolerated taint

Scheduling failure matrix:

Event message	Root cause	Investigation
`Insufficient nvidia.com/gpu`	No GPU capacity	Check node allocatable vs used
`node(s) had untolerated taint`	Missing toleration	Check node taints
`node(s) didn't match node selector`	Label mismatch	Check node labels
`exceeded quota`	ResourceQuota hit	Check namespace quota
`didn't match Pod's node affinity`	Affinity conflict	Check affinity rules
`Unschedulable`	Node cordoned	Check node status

Debugging scheduling:

# Comprehensive scheduling debug
kubectl describe pod <pod-name> | grep -A20 Events

# Check total vs allocated GPUs per node
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints[*].key,\
GPU_CAP:.status.capacity.\"nvidia\\.com/gpu\",\
GPU_ALLOC:.status.allocatable.\"nvidia\\.com/gpu\"

# Check GPU usage per node
kubectl describe nodes | grep -A5 "Allocated resources"

# Check who is using GPUs
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) |
  [.metadata.namespace, .metadata.name, .spec.nodeName,
   (.spec.containers[].resources.limits."nvidia.com/gpu" // "0")] |
  @tsv
' | column -t

# Check ResourceQuota
kubectl describe resourcequota -n <namespace>

# Check node taints
kubectl get nodes -l accelerator=nvidia -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints

Common scheduling fixes:

# Pod spec with all scheduling requirements
apiVersion: v1
kind: Pod
metadata:
  name: inference-gpu
spec:
  # Tolerate GPU node taints
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    - key: gpu-workload-only
      operator: Equal
      value: 'true'
      effect: NoSchedule

  # Select correct GPU type
  nodeSelector:
    accelerator: nvidia
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB

  # Or use node affinity for more flexibility
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.memory
                operator: Gt
                values: ['40000'] # At least 40GB VRAM

  containers:
    - name: inference
      resources:
        limits:
          nvidia.com/gpu: 1

ResourceQuota for GPU namespaces:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: '8'
    limits.nvidia.com/gpu: '8'
    pods: '20'
---
# Check current usage
# kubectl describe resourcequota gpu-quota -n ml-team

Network bottlenecks in inference

GPU inference can be network-bound even when GPUs are underutilized. This is especially true for:

Large input payloads (images, long text contexts)
Streaming responses (token-by-token output)
Multi-GPU inference with tensor parallelism
High-throughput batch inference

Network path diagram from client to GPU with numbered bottleneck callouts for external bandwidth, ingress/TLS, pod networking, and internal GPU I/O. — **Figure 4.** If latency is high but GPU util is low, walk the network path end-to-end.

Symptoms of network bottlenecks:

High latency despite GPU utilization below 50%
Time-to-first-token (TTFT) much higher than expected
Throughput does not scale with additional GPUs
Network interface saturation visible in node metrics

Debugging network performance:

# Check network throughput between nodes
kubectl run iperf-server --image=networkstatic/iperf3 --port=5201 \
  -- iperf3 -s

kubectl run iperf-client --image=networkstatic/iperf3 --rm -it \
  -- iperf3 -c iperf-server -t 10

# Check for packet drops
kubectl exec -it <pod> -- cat /proc/net/dev | grep -E "eth|ens"

# Check network latency
kubectl exec -it <pod> -- ping -c 10 <target-service>

# Monitor network on GPU nodes
kubectl top node --sort-by=network

# Check for NetworkPolicy drops (if using Calico)
kubectl exec -n calico-system -it <calico-node-pod> -- \
  calico-node -felix-live | grep -i drop

Network optimization for inference:

# Use hostNetwork for lowest latency (bypasses CNI)
apiVersion: v1
kind: Pod
spec:
  hostNetwork: true # Caution: reduces isolation
  dnsPolicy: ClusterFirstWithHostNet
  containers:
    - name: inference
      ports:
        - containerPort: 8000
          hostPort: 8000

---
# Or use high-performance CNI settings
# For Cilium: enable native routing and disable overlay
# For Calico: use IPIP=Never for direct routing

---
# Increase TCP buffer sizes for large payloads
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      securityContext:
        sysctls:
          - name: net.core.rmem_max
            value: '134217728'
          - name: net.core.wmem_max
            value: '134217728'

Multi-tenant GPU isolation

Shared GPU clusters create unique isolation challenges that do not exist with CPU workloads:

Isolation issue types:

Issue	Symptom	Detection
Memory fragmentation	OOM despite apparent free memory	nvidia-smi shows free but alloc fails
PCIe contention	Multi-GPU jobs slow on shared	Compare single vs multi-GPU perf
Thermal throttling	Latency increases over time	Check GPU temperature trends
Context switch overhead	Inconsistent latency	Monitor GPU SM utilization spikes
CPU bottleneck	GPU util low despite full queue	Profile CPU during inference

Debugging isolation issues:

# Check for thermal throttling
nvidia-smi -q | grep -A5 "Temperature\|Throttle"

# Monitor GPU activity in real-time
nvidia-smi dmon -s pucvmet -d 1

# Check for PCIe bandwidth limits
nvidia-smi topo -m

# Check GPU process isolation
nvidia-smi pmon -c 5

# Profile memory fragmentation
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1

MIG (Multi-Instance GPU) for isolation (A100/H100):

# Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1

# Create MIG instances (example: 7 x 1g.10gb slices)
sudo nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19

# Create compute instances
sudo nvidia-smi mig -i 0 -cci

# List MIG devices
nvidia-smi mig -lgi

# In Kubernetes, use nvidia.com/mig-1g.10gb instead of nvidia.com/gpu

# Pod requesting MIG slice instead of full GPU
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      resources:
        limits:
          nvidia.com/mig-3g.40gb: 1 # 3/7 of A100, 40GB VRAM

Pod cleanup for memory isolation:

apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: inference
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - |
                # Clear CUDA cache before termination
                python3 -c "import torch; torch.cuda.empty_cache()" 2>/dev/null || true
                # Give time for memory release
                sleep 5

Observability: comprehensive GPU monitoring

Minimum instrumentation for production GPU workloads:

GPU observability stack showing DCGM exporter feeding Prometheus feeding Grafana, with supporting tools for raw checks (nvidia-smi/rocm-smi), alerting, and logs. — **Figure 5.** Wire metrics → dashboards → alerts → logs, or you’ll debug blind during incidents.

Essential Prometheus metrics:

# GPU utilization (should be 60-80% for healthy workloads)
avg(DCGM_FI_DEV_GPU_UTIL) by (gpu, pod)

# GPU memory utilization
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# GPU temperature (throttling starts ~80°C)
DCGM_FI_DEV_GPU_TEMP

# PCIe throughput
rate(DCGM_FI_DEV_PCIE_TX_THROUGHPUT[5m])
rate(DCGM_FI_DEV_PCIE_RX_THROUGHPUT[5m])

# Power usage (indicates actual compute)
DCGM_FI_DEV_POWER_USAGE

# Error counts (XID errors)
increase(DCGM_FI_DEV_XID_ERRORS[1h])

Alert rules for GPU workloads:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
spec:
  groups:
    - name: gpu-health
      rules:
        - alert: GPUHighTemperature
          expr: DCGM_FI_DEV_GPU_TEMP > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU {{ $labels.gpu }} temperature {{ $value }}°C'

        - alert: GPUMemoryPressure
          expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU {{ $labels.gpu }} memory >95% used'

        - alert: GPUXIDError
          expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'GPU XID error detected on {{ $labels.gpu }}'

        - alert: GPULowUtilization
          expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: 'GPU {{ $labels.gpu }} utilization below 20% for 1h'

dcgm-exporter deployment:

# Install with Helm
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set arguments="{-f,/etc/dcgm-exporter/dcp-metrics-included.csv}"

Failure mode checklist

Quick reference for incident response:

Failure type	First check	Quick fix
CUDA error	`nvidia-smi` on node	Restart pod, check driver version
GPU OOM	`nvidia-smi --query-gpu=memory.used`	Reduce batch size, enable quantization
System OOM	`kubectl describe pod` (OOMKilled)	Increase memory limit
Scheduling failure	`kubectl describe pod` (Events)	Check taints, quotas, selectors
High latency	GPU utilization during request	Check network, queue depth
Inconsistent latency	Compare across pods	Check for noisy neighbors, use MIG
Pod CrashLoopBackOff	`kubectl logs <pod> --previous`	Check driver, memory, or config error
Node NotReady	`kubectl describe node`	Check kubelet, GPU driver, hardware

Next steps: If you want a systematic review of your GPU infrastructure failure modes and mitigations, the AI Infra Readiness Audit includes a risk register as a core deliverable.

Related reading:

SLOs for Inference - Defining reliability targets
GPU Cost Baseline - When failures drive up cost
AI Infra Readiness Audit checklist - The full diagnostic framework

3 min read

gpukubernetes

AI Infra Readiness Audit: What I Check (and What You Get)

A practical checklist for auditing production AI infrastructure: GPU cost baselines, reliability risks, and an executable roadmap.

4 min read

gpukubernetes

Hybrid/On-Prem GPU: The Boring GitOps Path

A practical guide to running GPU workloads on-prem or hybrid, using Kubernetes and GitOps patterns that make operations boring.