Skip to main content
FlexInfer docs

Operations

Common day-2 workflows: inspect, debug, and clean up.

Operations

Inspect what’s running

kubectl -n flexinfer-system get deploy,ds,svc
kubectl -n flexinfer-system get pods -o wide

Watch lifecycle state

v1alpha2

kubectl -n flexinfer-system get models -w
kubectl -n flexinfer-system describe model <name>

v1alpha1

kubectl -n flexinfer-system get modeldeployments -w
kubectl -n flexinfer-system describe modeldeployment <name>

Debug a model that won't become ready

  1. Check events:
    kubectl -n flexinfer-system describe pod <pod>
  2. Check backend logs:
    kubectl -n flexinfer-system logs <pod> -c model --tail=200
  3. Confirm GPU resources:
    kubectl get nodes -o wide
    kubectl describe node <node> | rg -n "nvidia.com/gpu|amd.com/gpu"

NVIDIA GPU requirements

Why runtimeClassName: nvidia is required

NVIDIA GPU workloads require the nvidia container runtime to function. FlexInfer automatically sets runtimeClassName: nvidia on pods requesting nvidia.com/gpu resources.

Without this runtime class:

  • The pod may schedule successfully (it requests nvidia.com/gpu and a node has capacity)
  • But /dev/nvidia* device nodes won't be mounted into the container
  • CUDA will report no devices available

Verifying NVIDIA runtime is working

  1. Check if the runtime class exists:

    kubectl get runtimeclass nvidia
  2. Verify devices are visible inside the pod:

    kubectl -n flexinfer-system exec <pod> -- ls /dev/nvidia*
    # Should show: /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm ...
  3. Check CUDA availability:

    kubectl -n flexinfer-system exec <pod> -- python -c "import torch; print(torch.cuda.is_available())"
    # Should print: True

Common failure symptoms

SymptomLikely Cause
torch.cuda.is_available() == FalseMissing runtimeClassName: nvidia or NVIDIA driver not installed
Pod stuck in ContainerCreatingRuntimeClass nvidia doesn't exist on the cluster
CUDA error: no CUDA-capable device is detectedDevice nodes not mounted; check runtime class
Pod runs but inference is slowFell back to CPU; check device availability

Cluster prerequisites

  1. NVIDIA device plugin must be deployed (creates nvidia.com/gpu resources)
  2. NVIDIA container runtime must be installed on GPU nodes
  3. RuntimeClass nvidia must exist:
    apiVersion: node.k8s.io/v1
    kind: RuntimeClass
    metadata:
      name: nvidia
    handler: nvidia

Clean up (important: delete parents)

FlexInfer resources are hierarchical. Delete the parent, not the children.

  • v1alpha2: delete the Model
    kubectl -n flexinfer-system delete model <name>
  • v1alpha1: delete the ModelDeployment (not the Deployment)
    kubectl -n flexinfer-system delete modeldeployment <name>

For detailed cleanup guidance (including RAM caches and stuck Jobs), see the "Resource Cleanup Procedures" section in services/flexinfer/AGENTS.md.

AMD ROCm GPU requirements

Container setup

AMD GPUs require ROCm-compatible container images. FlexInfer uses ROCm variants automatically when AMD GPUs are detected.

# Verify ROCm device visibility
kubectl -n flexinfer-system exec <pod> -- ls /dev/dri/
# Should show: card0 renderD128 (or similar)

kubectl -n flexinfer-system exec <pod> -- rocm-smi
# Should show GPU(s) with temperature, utilization, etc.

Common AMD issues

SymptomLikely Cause
No GPU detectedMissing ROCm container toolkit or device plugin
HSA_STATUS_ERROR_OUT_OF_RESOURCESInsufficient GPU memory; reduce batch size
Slow inferenceUsing CPU fallback; check /dev/kfd visibility

Cluster prerequisites for AMD

  1. AMD device plugin deployed (creates amd.com/gpu resources)
  2. ROCm drivers installed on GPU nodes (6.0+ recommended)
  3. Container runtime configured for AMD GPUs

Backend-specific quirks

Ollama

  • Model naming: Ollama uses model:tag format (e.g., llama3.2:1b)
  • Pull on first use: Model downloads on first request if not cached
  • Memory: Ollama manages its own memory; set OLLAMA_NUM_PARALLEL for concurrency

vLLM

  • Memory configuration: vLLM pre-allocates GPU memory
    spec:
      config:
        gpu-memory-utilization: "0.9"  # Use 90% of GPU memory
  • Tensor parallelism: For multi-GPU, set tensor-parallel-size
  • Known issue: vLLM 0.4+ requires specific CUDA versions; check compatibility

MLC-LLM

  • Model format: Requires MLC-compiled models (.mlc format)
  • Source URI: Use HF://mlc-ai/<model>-MLC for pre-compiled models
  • Maxwell GPUs (sm_52): Use the Maxwell-specific image variant
    spec:
      image: registry.harbor.lan/flexinfer/mlc-llm:cuda-maxwell-v7

llama.cpp

  • Model format: Requires GGUF format models
  • CPU fallback: Works without GPU; useful for testing
  • Memory mapping: Uses mmap by default; can reduce memory usage
    spec:
      config:
        n-gpu-layers: "35"  # Number of layers to offload to GPU

ComfyUI / Diffusers

  • Image generation: These backends are for image models, not LLMs
  • VRAM requirements: Typically need 8GB+ VRAM for image models
  • Workflow files: ComfyUI requires workflow JSON in the request

Troubleshooting decision tree

Model not becoming Ready?
├── Check phase: kubectl describe model <name>
│   ├── Pending → No matching nodes (check GPU labels, node selector)
│   ├── Downloading → Network issue or invalid source URI
│   ├── Creating → Check pod events and logs
│   └── Error → Check conditions for specific reason
│
├── Pod not starting?
│   ├── ImagePullBackOff → Check image name/registry access
│   ├── ContainerCreating → Check RuntimeClass, volume mounts
│   └── CrashLoopBackOff → Check container logs
│
└── Pod running but model not responding?
    ├── Check model container logs
    ├── Verify port-forward to pod directly
    └── Check health endpoint: /health or /v1/models

Metrics and monitoring

FlexInfer exposes Prometheus metrics:

# Scrape metrics from controller
kubectl -n flexinfer-system port-forward deploy/flexinfer-controller 8080:8080
curl localhost:8080/metrics

# Scrape metrics from proxy
kubectl -n flexinfer-system port-forward svc/flexinfer-proxy 8080:8080
curl localhost:8080/metrics

Key metrics:

  • flexinfer_models_total{phase} - Models by phase
  • flexinfer_proxy_requests_total{model,status} - Request counts
  • flexinfer_proxy_queue_depth{model} - Pending requests per model