GPU Failure Modes: What Breaks and How to Debug It
December 29, 2025·5 min read
GPU infrastructure fails differently than CPU infrastructure. The failure modes are less familiar, the debugging tools are less mature, and the blast radius is often larger (GPU nodes are expensive, so you have fewer of them).
Here's a field guide to the failures I see most often and how to debug them.
Diagnostic decision tree
When GPU workloads fail, start here:
Quick diagnostic commands:
# Step 1: Check pod status
kubectl get pods -l app=inference -o wide
# Step 2: Check pod events (scheduling issues show here)
kubectl describe pod <pod-name> | tail -20
# Step 3: Check GPU node status
kubectl get nodes -l accelerator=nvidia -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
GPU:.status.allocatable.\"nvidia\\.com/gpu\"
# Step 4: Check GPU health (NVIDIA)
kubectl exec -it <pod-on-gpu-node> -- nvidia-smi
# Step 5: Check recent logs
kubectl logs <pod-name> --tail=100
Driver and runtime failures (CUDA/ROCm)
The GPU driver stack has multiple layers, and each can fail:
What you will see:
CUDA error: no kernel image is available for executionROCm runtime error: invalid device ordinalCUDA error: CUDA driver version is insufficient for CUDA runtime version- Pods stuck in CrashLoopBackOff with GPU-related errors
- Silent failures where GPU is "available" but not functional
Root causes:
| Error pattern | Likely cause | Fix |
|---|---|---|
no kernel image available | CUDA compute capability mismatch | Use correct base image |
driver version insufficient | Host driver too old | Update host driver |
invalid device ordinal | GPU not visible to container | Check device plugin |
CUDA initialization error | Driver crash or GPU hang | Restart node or GPU reset |
Debugging driver issues:
# Check host driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Check CUDA runtime version in container
kubectl exec -it <pod> -- nvcc --version
# Check for driver errors in kernel log
kubectl debug node/<gpu-node> -it --image=busybox -- \
cat /host/var/log/dmesg | grep -i "nvidia\|nvrm\|xid"
# Check GPU error codes (XID errors indicate hardware/driver issues)
nvidia-smi -q | grep -i "Xid\|Error"
# For AMD: Check ROCm version and device status
rocm-smi --showdriverversion
rocm-smi --showhw
Common XID error codes (NVIDIA):
| XID | Meaning | Action |
|---|---|---|
| 13 | Graphics engine exception | Check for OOM or driver bug |
| 31 | GPU memory page fault | Usually OOM, check memory usage |
| 43 | GPU stopped responding | GPU hang, needs reset |
| 45 | Preemptive cleanup | Normal during driver recovery |
| 48 | DBE (double-bit error) | Hardware failure, replace GPU |
| 63 | ECC page retirement | Too many ECC errors, replace GPU |
| 79 | GPU off the bus | PCIe issue, check physical connection |
Prevention strategies:
# Add driver version label to GPU nodes
apiVersion: v1
kind: Node
metadata:
name: gpu-node-01
labels:
nvidia.com/cuda.driver.major: '535'
nvidia.com/cuda.driver.minor: '129'
nvidia.com/cuda.driver.rev: '03'
---
# Use node selector to ensure driver compatibility
apiVersion: v1
kind: Pod
spec:
nodeSelector:
nvidia.com/cuda.driver.major: '535'
containers:
- name: inference
image: vllm/vllm-openai:v0.4.0 # Built for CUDA 12.1
OOM: GPU memory vs system memory
GPU OOM is the most common inference failure, but it is often misdiagnosed. There are two distinct OOM scenarios:
GPU OOM symptoms:
CUDA error: out of memorytorch.cuda.OutOfMemoryError- Process crashes but pod shows Running (not OOMKilled)
- Inference latency spikes before crash (memory pressure)
System OOM symptoms:
- Pod status shows
OOMKilled - Container restarts with exit code 137
- Often happens during large batch processing
Debugging memory issues:
# Check current GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total,memory.free \
--format=csv,noheader,nounits
# Monitor GPU memory over time
watch -n 1 'nvidia-smi --query-gpu=memory.used --format=csv,noheader'
# Check memory allocation by process
nvidia-smi pmon -c 1
# For detailed CUDA memory accounting in PyTorch:
# Add this to your container
kubectl exec -it <pod> -- python3 -c "
import torch
print(torch.cuda.memory_summary())
"
# Check system memory
kubectl top pod <pod-name>
kubectl describe pod <pod-name> | grep -A10 "Last State"
# For AMD GPUs
rocm-smi --showmeminfo vram
rocm-smi --showmemuse
Memory calculation for LLMs:
Model memory ≈ (parameters × bytes_per_param) + overhead
Examples (FP16):
- 7B model: 7B × 2 bytes = ~14GB
- 13B model: 13B × 2 bytes = ~26GB
- 70B model: 70B × 2 bytes = ~140GB (needs multi-GPU)
KV cache per request ≈ 2 × layers × heads × dim × context_length × 2 bytes
Example (Llama 7B, 4096 context):
- 2 × 32 × 32 × 128 × 4096 × 2 = ~2GB per concurrent request
Prevention and fixes:
# Set environment variables for better GPU memory management
apiVersion: v1
kind: Pod
spec:
containers:
- name: inference
env:
# PyTorch memory management
- name: PYTORCH_CUDA_ALLOC_CONF
value: 'expandable_segments:True,max_split_size_mb:128'
# Limit CUDA memory usage (reserve 1GB for system)
- name: CUDA_MEMORY_FRACTION
value: '0.95'
# For vLLM: limit KV cache
- name: VLLM_GPU_MEMORY_UTILIZATION
value: '0.85'
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi # System memory limit
requests:
memory: 24Gi
Quantization to reduce memory:
| Precision | Memory per 7B | Memory per 70B | Accuracy loss |
|---|---|---|---|
| FP32 | ~28GB | ~280GB | Baseline |
| FP16 | ~14GB | ~140GB | Negligible |
| INT8 | ~7GB | ~70GB | Minor |
| INT4 | ~3.5GB | ~35GB | Moderate |
Scheduling failures (pods stuck pending)
GPU scheduling in Kubernetes has more failure modes than CPU scheduling because GPUs are scarce, expensive, and often have special node configurations.
What you will see:
- Pod stuck in
Pendingstate indefinitely - Events:
0/5 nodes are available: 5 Insufficient nvidia.com/gpu - Events:
0/5 nodes are available: 5 node(s) had untolerated taint
Scheduling failure matrix:
| Event message | Root cause | Investigation |
|---|---|---|
Insufficient nvidia.com/gpu | No GPU capacity | Check node allocatable vs used |
node(s) had untolerated taint | Missing toleration | Check node taints |
node(s) didn't match node selector | Label mismatch | Check node labels |
exceeded quota | ResourceQuota hit | Check namespace quota |
didn't match Pod's node affinity | Affinity conflict | Check affinity rules |
Unschedulable | Node cordoned | Check node status |
Debugging scheduling:
# Comprehensive scheduling debug
kubectl describe pod <pod-name> | grep -A20 Events
# Check total vs allocated GPUs per node
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints[*].key,\
GPU_CAP:.status.capacity.\"nvidia\\.com/gpu\",\
GPU_ALLOC:.status.allocatable.\"nvidia\\.com/gpu\"
# Check GPU usage per node
kubectl describe nodes | grep -A5 "Allocated resources"
# Check who is using GPUs
kubectl get pods -A -o json | jq -r '
.items[] |
select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) |
[.metadata.namespace, .metadata.name, .spec.nodeName,
(.spec.containers[].resources.limits."nvidia.com/gpu" // "0")] |
@tsv
' | column -t
# Check ResourceQuota
kubectl describe resourcequota -n <namespace>
# Check node taints
kubectl get nodes -l accelerator=nvidia -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints
Common scheduling fixes:
# Pod spec with all scheduling requirements
apiVersion: v1
kind: Pod
metadata:
name: inference-gpu
spec:
# Tolerate GPU node taints
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: gpu-workload-only
operator: Equal
value: 'true'
effect: NoSchedule
# Select correct GPU type
nodeSelector:
accelerator: nvidia
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
# Or use node affinity for more flexibility
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.memory
operator: Gt
values: ['40000'] # At least 40GB VRAM
containers:
- name: inference
resources:
limits:
nvidia.com/gpu: 1
ResourceQuota for GPU namespaces:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-team
spec:
hard:
requests.nvidia.com/gpu: '8'
limits.nvidia.com/gpu: '8'
pods: '20'
---
# Check current usage
# kubectl describe resourcequota gpu-quota -n ml-team
Network bottlenecks in inference
GPU inference can be network-bound even when GPUs are underutilized. This is especially true for:
- Large input payloads (images, long text contexts)
- Streaming responses (token-by-token output)
- Multi-GPU inference with tensor parallelism
- High-throughput batch inference
Symptoms of network bottlenecks:
- High latency despite GPU utilization below 50%
- Time-to-first-token (TTFT) much higher than expected
- Throughput does not scale with additional GPUs
- Network interface saturation visible in node metrics
Debugging network performance:
# Check network throughput between nodes
kubectl run iperf-server --image=networkstatic/iperf3 --port=5201 \
-- iperf3 -s
kubectl run iperf-client --image=networkstatic/iperf3 --rm -it \
-- iperf3 -c iperf-server -t 10
# Check for packet drops
kubectl exec -it <pod> -- cat /proc/net/dev | grep -E "eth|ens"
# Check network latency
kubectl exec -it <pod> -- ping -c 10 <target-service>
# Monitor network on GPU nodes
kubectl top node --sort-by=network
# Check for NetworkPolicy drops (if using Calico)
kubectl exec -n calico-system -it <calico-node-pod> -- \
calico-node -felix-live | grep -i drop
Network optimization for inference:
# Use hostNetwork for lowest latency (bypasses CNI)
apiVersion: v1
kind: Pod
spec:
hostNetwork: true # Caution: reduces isolation
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: inference
ports:
- containerPort: 8000
hostPort: 8000
---
# Or use high-performance CNI settings
# For Cilium: enable native routing and disable overlay
# For Calico: use IPIP=Never for direct routing
---
# Increase TCP buffer sizes for large payloads
apiVersion: v1
kind: Pod
spec:
containers:
- name: inference
securityContext:
sysctls:
- name: net.core.rmem_max
value: '134217728'
- name: net.core.wmem_max
value: '134217728'
Multi-tenant GPU isolation
Shared GPU clusters create unique isolation challenges that do not exist with CPU workloads:
Isolation issue types:
| Issue | Symptom | Detection |
|---|---|---|
| Memory fragmentation | OOM despite apparent free memory | nvidia-smi shows free but alloc fails |
| PCIe contention | Multi-GPU jobs slow on shared | Compare single vs multi-GPU perf |
| Thermal throttling | Latency increases over time | Check GPU temperature trends |
| Context switch overhead | Inconsistent latency | Monitor GPU SM utilization spikes |
| CPU bottleneck | GPU util low despite full queue | Profile CPU during inference |
Debugging isolation issues:
# Check for thermal throttling
nvidia-smi -q | grep -A5 "Temperature\|Throttle"
# Monitor GPU activity in real-time
nvidia-smi dmon -s pucvmet -d 1
# Check for PCIe bandwidth limits
nvidia-smi topo -m
# Check GPU process isolation
nvidia-smi pmon -c 5
# Profile memory fragmentation
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
MIG (Multi-Instance GPU) for isolation (A100/H100):
# Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1
# Create MIG instances (example: 7 x 1g.10gb slices)
sudo nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19
# Create compute instances
sudo nvidia-smi mig -i 0 -cci
# List MIG devices
nvidia-smi mig -lgi
# In Kubernetes, use nvidia.com/mig-1g.10gb instead of nvidia.com/gpu
# Pod requesting MIG slice instead of full GPU
apiVersion: v1
kind: Pod
spec:
containers:
- name: inference
resources:
limits:
nvidia.com/mig-3g.40gb: 1 # 3/7 of A100, 40GB VRAM
Pod cleanup for memory isolation:
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 30
containers:
- name: inference
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Clear CUDA cache before termination
python3 -c "import torch; torch.cuda.empty_cache()" 2>/dev/null || true
# Give time for memory release
sleep 5
Observability: comprehensive GPU monitoring
Minimum instrumentation for production GPU workloads:
Essential Prometheus metrics:
# GPU utilization (should be 60-80% for healthy workloads)
avg(DCGM_FI_DEV_GPU_UTIL) by (gpu, pod)
# GPU memory utilization
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100
# GPU temperature (throttling starts ~80°C)
DCGM_FI_DEV_GPU_TEMP
# PCIe throughput
rate(DCGM_FI_DEV_PCIE_TX_THROUGHPUT[5m])
rate(DCGM_FI_DEV_PCIE_RX_THROUGHPUT[5m])
# Power usage (indicates actual compute)
DCGM_FI_DEV_POWER_USAGE
# Error counts (XID errors)
increase(DCGM_FI_DEV_XID_ERRORS[1h])
Alert rules for GPU workloads:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
spec:
groups:
- name: gpu-health
rules:
- alert: GPUHighTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 80
for: 5m
labels:
severity: warning
annotations:
summary: 'GPU {{ $labels.gpu }} temperature {{ $value }}°C'
- alert: GPUMemoryPressure
expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: 'GPU {{ $labels.gpu }} memory >95% used'
- alert: GPUXIDError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
labels:
severity: critical
annotations:
summary: 'GPU XID error detected on {{ $labels.gpu }}'
- alert: GPULowUtilization
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
for: 1h
labels:
severity: warning
annotations:
summary: 'GPU {{ $labels.gpu }} utilization below 20% for 1h'
dcgm-exporter deployment:
# Install with Helm
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--set serviceMonitor.enabled=true \
--set arguments="{-f,/etc/dcgm-exporter/dcp-metrics-included.csv}"
Failure mode checklist
Quick reference for incident response:
| Failure type | First check | Quick fix |
|---|---|---|
| CUDA error | nvidia-smi on node | Restart pod, check driver version |
| GPU OOM | nvidia-smi --query-gpu=memory.used | Reduce batch size, enable quantization |
| System OOM | kubectl describe pod (OOMKilled) | Increase memory limit |
| Scheduling failure | kubectl describe pod (Events) | Check taints, quotas, selectors |
| High latency | GPU utilization during request | Check network, queue depth |
| Inconsistent latency | Compare across pods | Check for noisy neighbors, use MIG |
| Pod CrashLoopBackOff | kubectl logs <pod> --previous | Check driver, memory, or config error |
| Node NotReady | kubectl describe node | Check kubelet, GPU driver, hardware |
Next steps: If you want a systematic review of your GPU infrastructure failure modes and mitigations, the AI Infra Readiness Audit includes a risk register as a core deliverable.
Related reading:
- SLOs for Inference - Defining reliability targets
- GPU Cost Baseline - When failures drive up cost
- AI Infra Readiness Audit checklist - The full diagnostic framework
Related Articles
Comments
Join the discussion. Be respectful.