Skip to main content
Back to Blog

GPU Failure Modes: What Breaks and How to Debug It

December 29, 2025·5 min read

Professionalreliabilitygpudebuggingkubernetesai-infra-readiness

GPU infrastructure fails differently than CPU infrastructure. The failure modes are less familiar, the debugging tools are less mature, and the blast radius is often larger (GPU nodes are expensive, so you have fewer of them).

Here's a field guide to the failures I see most often and how to debug them.

Diagnostic decision tree

When GPU workloads fail, start here:

Decision tree for GPU incidents: start with whether the pod is stuck pending, then distinguish runtime errors from performance degradations.
Figure 1. Start with scheduling vs runtime vs performance. It narrows the search space quickly.

Quick diagnostic commands:

# Step 1: Check pod status
kubectl get pods -l app=inference -o wide

# Step 2: Check pod events (scheduling issues show here)
kubectl describe pod <pod-name> | tail -20

# Step 3: Check GPU node status
kubectl get nodes -l accelerator=nvidia -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
GPU:.status.allocatable.\"nvidia\\.com/gpu\"

# Step 4: Check GPU health (NVIDIA)
kubectl exec -it <pod-on-gpu-node> -- nvidia-smi

# Step 5: Check recent logs
kubectl logs <pod-name> --tail=100

Driver and runtime failures (CUDA/ROCm)

The GPU driver stack has multiple layers, and each can fail:

Layered GPU driver stack from application down to hardware, with callouts for where version mismatches, driver crashes, and hardware errors tend to surface.
Figure 2. Most “mystery” GPU failures are boundary failures: runtime ↔ driver ↔ kernel.

What you will see:

  • CUDA error: no kernel image is available for execution
  • ROCm runtime error: invalid device ordinal
  • CUDA error: CUDA driver version is insufficient for CUDA runtime version
  • Pods stuck in CrashLoopBackOff with GPU-related errors
  • Silent failures where GPU is "available" but not functional

Root causes:

Error patternLikely causeFix
no kernel image availableCUDA compute capability mismatchUse correct base image
driver version insufficientHost driver too oldUpdate host driver
invalid device ordinalGPU not visible to containerCheck device plugin
CUDA initialization errorDriver crash or GPU hangRestart node or GPU reset

Debugging driver issues:

# Check host driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# Check CUDA runtime version in container
kubectl exec -it <pod> -- nvcc --version

# Check for driver errors in kernel log
kubectl debug node/<gpu-node> -it --image=busybox -- \
  cat /host/var/log/dmesg | grep -i "nvidia\|nvrm\|xid"

# Check GPU error codes (XID errors indicate hardware/driver issues)
nvidia-smi -q | grep -i "Xid\|Error"

# For AMD: Check ROCm version and device status
rocm-smi --showdriverversion
rocm-smi --showhw

Common XID error codes (NVIDIA):

XIDMeaningAction
13Graphics engine exceptionCheck for OOM or driver bug
31GPU memory page faultUsually OOM, check memory usage
43GPU stopped respondingGPU hang, needs reset
45Preemptive cleanupNormal during driver recovery
48DBE (double-bit error)Hardware failure, replace GPU
63ECC page retirementToo many ECC errors, replace GPU
79GPU off the busPCIe issue, check physical connection

Prevention strategies:

# Add driver version label to GPU nodes
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-01
  labels:
    nvidia.com/cuda.driver.major: '535'
    nvidia.com/cuda.driver.minor: '129'
    nvidia.com/cuda.driver.rev: '03'

---
# Use node selector to ensure driver compatibility
apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    nvidia.com/cuda.driver.major: '535'
  containers:
    - name: inference
      image: vllm/vllm-openai:v0.4.0 # Built for CUDA 12.1

OOM: GPU memory vs system memory

GPU OOM is the most common inference failure, but it is often misdiagnosed. There are two distinct OOM scenarios:

Two-column diagram contrasting system memory (RAM) versus GPU memory (VRAM) and how out-of-memory failures present in Kubernetes.
Figure 3. RAM OOMs are Kubernetes-visible; VRAM OOMs often look like “random” runtime crashes.

GPU OOM symptoms:

  • CUDA error: out of memory
  • torch.cuda.OutOfMemoryError
  • Process crashes but pod shows Running (not OOMKilled)
  • Inference latency spikes before crash (memory pressure)

System OOM symptoms:

  • Pod status shows OOMKilled
  • Container restarts with exit code 137
  • Often happens during large batch processing

Debugging memory issues:

# Check current GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total,memory.free \
  --format=csv,noheader,nounits

# Monitor GPU memory over time
watch -n 1 'nvidia-smi --query-gpu=memory.used --format=csv,noheader'

# Check memory allocation by process
nvidia-smi pmon -c 1

# For detailed CUDA memory accounting in PyTorch:
# Add this to your container
kubectl exec -it <pod> -- python3 -c "
import torch
print(torch.cuda.memory_summary())
"

# Check system memory
kubectl top pod <pod-name>
kubectl describe pod <pod-name> | grep -A10 "Last State"

# For AMD GPUs
rocm-smi --showmeminfo vram
rocm-smi --showmemuse

Memory calculation for LLMs:

Model memory ≈ (parameters × bytes_per_param) + overhead

Examples (FP16):
- 7B model:   7B × 2 bytes = ~14GB
- 13B model: 13B × 2 bytes = ~26GB
- 70B model: 70B × 2 bytes = ~140GB (needs multi-GPU)

KV cache per request ≈ 2 × layers × heads × dim × context_length × 2 bytes

Example (Llama 7B, 4096 context):
- 2 × 32 × 32 × 128 × 4096 × 2 = ~2GB per concurrent request

Prevention and fixes:

# Set environment variables for better GPU memory management
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      env:
        # PyTorch memory management
        - name: PYTORCH_CUDA_ALLOC_CONF
          value: 'expandable_segments:True,max_split_size_mb:128'
        # Limit CUDA memory usage (reserve 1GB for system)
        - name: CUDA_MEMORY_FRACTION
          value: '0.95'
        # For vLLM: limit KV cache
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: '0.85'
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: 32Gi # System memory limit
        requests:
          memory: 24Gi

Quantization to reduce memory:

PrecisionMemory per 7BMemory per 70BAccuracy loss
FP32~28GB~280GBBaseline
FP16~14GB~140GBNegligible
INT8~7GB~70GBMinor
INT4~3.5GB~35GBModerate

Scheduling failures (pods stuck pending)

GPU scheduling in Kubernetes has more failure modes than CPU scheduling because GPUs are scarce, expensive, and often have special node configurations.

What you will see:

  • Pod stuck in Pending state indefinitely
  • Events: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu
  • Events: 0/5 nodes are available: 5 node(s) had untolerated taint

Scheduling failure matrix:

Event messageRoot causeInvestigation
Insufficient nvidia.com/gpuNo GPU capacityCheck node allocatable vs used
node(s) had untolerated taintMissing tolerationCheck node taints
node(s) didn't match node selectorLabel mismatchCheck node labels
exceeded quotaResourceQuota hitCheck namespace quota
didn't match Pod's node affinityAffinity conflictCheck affinity rules
UnschedulableNode cordonedCheck node status

Debugging scheduling:

# Comprehensive scheduling debug
kubectl describe pod <pod-name> | grep -A20 Events

# Check total vs allocated GPUs per node
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints[*].key,\
GPU_CAP:.status.capacity.\"nvidia\\.com/gpu\",\
GPU_ALLOC:.status.allocatable.\"nvidia\\.com/gpu\"

# Check GPU usage per node
kubectl describe nodes | grep -A5 "Allocated resources"

# Check who is using GPUs
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) |
  [.metadata.namespace, .metadata.name, .spec.nodeName,
   (.spec.containers[].resources.limits."nvidia.com/gpu" // "0")] |
  @tsv
' | column -t

# Check ResourceQuota
kubectl describe resourcequota -n <namespace>

# Check node taints
kubectl get nodes -l accelerator=nvidia -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints

Common scheduling fixes:

# Pod spec with all scheduling requirements
apiVersion: v1
kind: Pod
metadata:
  name: inference-gpu
spec:
  # Tolerate GPU node taints
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    - key: gpu-workload-only
      operator: Equal
      value: 'true'
      effect: NoSchedule

  # Select correct GPU type
  nodeSelector:
    accelerator: nvidia
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB

  # Or use node affinity for more flexibility
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.memory
                operator: Gt
                values: ['40000'] # At least 40GB VRAM

  containers:
    - name: inference
      resources:
        limits:
          nvidia.com/gpu: 1

ResourceQuota for GPU namespaces:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: '8'
    limits.nvidia.com/gpu: '8'
    pods: '20'
---
# Check current usage
# kubectl describe resourcequota gpu-quota -n ml-team

Network bottlenecks in inference

GPU inference can be network-bound even when GPUs are underutilized. This is especially true for:

  • Large input payloads (images, long text contexts)
  • Streaming responses (token-by-token output)
  • Multi-GPU inference with tensor parallelism
  • High-throughput batch inference
Network path diagram from client to GPU with numbered bottleneck callouts for external bandwidth, ingress/TLS, pod networking, and internal GPU I/O.
Figure 4. If latency is high but GPU util is low, walk the network path end-to-end.

Symptoms of network bottlenecks:

  • High latency despite GPU utilization below 50%
  • Time-to-first-token (TTFT) much higher than expected
  • Throughput does not scale with additional GPUs
  • Network interface saturation visible in node metrics

Debugging network performance:

# Check network throughput between nodes
kubectl run iperf-server --image=networkstatic/iperf3 --port=5201 \
  -- iperf3 -s

kubectl run iperf-client --image=networkstatic/iperf3 --rm -it \
  -- iperf3 -c iperf-server -t 10

# Check for packet drops
kubectl exec -it <pod> -- cat /proc/net/dev | grep -E "eth|ens"

# Check network latency
kubectl exec -it <pod> -- ping -c 10 <target-service>

# Monitor network on GPU nodes
kubectl top node --sort-by=network

# Check for NetworkPolicy drops (if using Calico)
kubectl exec -n calico-system -it <calico-node-pod> -- \
  calico-node -felix-live | grep -i drop

Network optimization for inference:

# Use hostNetwork for lowest latency (bypasses CNI)
apiVersion: v1
kind: Pod
spec:
  hostNetwork: true # Caution: reduces isolation
  dnsPolicy: ClusterFirstWithHostNet
  containers:
    - name: inference
      ports:
        - containerPort: 8000
          hostPort: 8000

---
# Or use high-performance CNI settings
# For Cilium: enable native routing and disable overlay
# For Calico: use IPIP=Never for direct routing

---
# Increase TCP buffer sizes for large payloads
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      securityContext:
        sysctls:
          - name: net.core.rmem_max
            value: '134217728'
          - name: net.core.wmem_max
            value: '134217728'

Multi-tenant GPU isolation

Shared GPU clusters create unique isolation challenges that do not exist with CPU workloads:

Isolation issue types:

IssueSymptomDetection
Memory fragmentationOOM despite apparent free memorynvidia-smi shows free but alloc fails
PCIe contentionMulti-GPU jobs slow on sharedCompare single vs multi-GPU perf
Thermal throttlingLatency increases over timeCheck GPU temperature trends
Context switch overheadInconsistent latencyMonitor GPU SM utilization spikes
CPU bottleneckGPU util low despite full queueProfile CPU during inference

Debugging isolation issues:

# Check for thermal throttling
nvidia-smi -q | grep -A5 "Temperature\|Throttle"

# Monitor GPU activity in real-time
nvidia-smi dmon -s pucvmet -d 1

# Check for PCIe bandwidth limits
nvidia-smi topo -m

# Check GPU process isolation
nvidia-smi pmon -c 5

# Profile memory fragmentation
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1

MIG (Multi-Instance GPU) for isolation (A100/H100):

# Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1

# Create MIG instances (example: 7 x 1g.10gb slices)
sudo nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19

# Create compute instances
sudo nvidia-smi mig -i 0 -cci

# List MIG devices
nvidia-smi mig -lgi

# In Kubernetes, use nvidia.com/mig-1g.10gb instead of nvidia.com/gpu
# Pod requesting MIG slice instead of full GPU
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      resources:
        limits:
          nvidia.com/mig-3g.40gb: 1 # 3/7 of A100, 40GB VRAM

Pod cleanup for memory isolation:

apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: inference
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - |
                # Clear CUDA cache before termination
                python3 -c "import torch; torch.cuda.empty_cache()" 2>/dev/null || true
                # Give time for memory release
                sleep 5

Observability: comprehensive GPU monitoring

Minimum instrumentation for production GPU workloads:

GPU observability stack showing DCGM exporter feeding Prometheus feeding Grafana, with supporting tools for raw checks (nvidia-smi/rocm-smi), alerting, and logs.
Figure 5. Wire metrics → dashboards → alerts → logs, or you’ll debug blind during incidents.

Essential Prometheus metrics:

# GPU utilization (should be 60-80% for healthy workloads)
avg(DCGM_FI_DEV_GPU_UTIL) by (gpu, pod)

# GPU memory utilization
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# GPU temperature (throttling starts ~80°C)
DCGM_FI_DEV_GPU_TEMP

# PCIe throughput
rate(DCGM_FI_DEV_PCIE_TX_THROUGHPUT[5m])
rate(DCGM_FI_DEV_PCIE_RX_THROUGHPUT[5m])

# Power usage (indicates actual compute)
DCGM_FI_DEV_POWER_USAGE

# Error counts (XID errors)
increase(DCGM_FI_DEV_XID_ERRORS[1h])

Alert rules for GPU workloads:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
spec:
  groups:
    - name: gpu-health
      rules:
        - alert: GPUHighTemperature
          expr: DCGM_FI_DEV_GPU_TEMP > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU {{ $labels.gpu }} temperature {{ $value }}°C'

        - alert: GPUMemoryPressure
          expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU {{ $labels.gpu }} memory >95% used'

        - alert: GPUXIDError
          expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'GPU XID error detected on {{ $labels.gpu }}'

        - alert: GPULowUtilization
          expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: 'GPU {{ $labels.gpu }} utilization below 20% for 1h'

dcgm-exporter deployment:

# Install with Helm
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set arguments="{-f,/etc/dcgm-exporter/dcp-metrics-included.csv}"

Failure mode checklist

Quick reference for incident response:

Failure typeFirst checkQuick fix
CUDA errornvidia-smi on nodeRestart pod, check driver version
GPU OOMnvidia-smi --query-gpu=memory.usedReduce batch size, enable quantization
System OOMkubectl describe pod (OOMKilled)Increase memory limit
Scheduling failurekubectl describe pod (Events)Check taints, quotas, selectors
High latencyGPU utilization during requestCheck network, queue depth
Inconsistent latencyCompare across podsCheck for noisy neighbors, use MIG
Pod CrashLoopBackOffkubectl logs <pod> --previousCheck driver, memory, or config error
Node NotReadykubectl describe nodeCheck kubelet, GPU driver, hardware

Next steps: If you want a systematic review of your GPU infrastructure failure modes and mitigations, the AI Infra Readiness Audit includes a risk register as a core deliverable.

Related reading:

Related Articles

Comments

Join the discussion. Be respectful.