FlexInfer docs
Operations
Common day-2 workflows: inspect, debug, and clean up.
Operations
Inspect what’s running
kubectl -n flexinfer-system get deploy,ds,svc
kubectl -n flexinfer-system get pods -o wide
Watch lifecycle state
v1alpha2
kubectl -n flexinfer-system get models -w
kubectl -n flexinfer-system describe model <name>
v1alpha1
kubectl -n flexinfer-system get modeldeployments -w
kubectl -n flexinfer-system describe modeldeployment <name>
Debug a model that won't become ready
- Check events:
kubectl -n flexinfer-system describe pod <pod> - Check backend logs:
kubectl -n flexinfer-system logs <pod> -c model --tail=200 - Confirm GPU resources:
kubectl get nodes -o wide kubectl describe node <node> | rg -n "nvidia.com/gpu|amd.com/gpu"
NVIDIA GPU requirements
Why runtimeClassName: nvidia is required
NVIDIA GPU workloads require the nvidia container runtime to function. FlexInfer automatically sets runtimeClassName: nvidia on pods requesting nvidia.com/gpu resources.
Without this runtime class:
- The pod may schedule successfully (it requests
nvidia.com/gpuand a node has capacity) - But
/dev/nvidia*device nodes won't be mounted into the container - CUDA will report no devices available
Verifying NVIDIA runtime is working
-
Check if the runtime class exists:
kubectl get runtimeclass nvidia -
Verify devices are visible inside the pod:
kubectl -n flexinfer-system exec <pod> -- ls /dev/nvidia* # Should show: /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm ... -
Check CUDA availability:
kubectl -n flexinfer-system exec <pod> -- python -c "import torch; print(torch.cuda.is_available())" # Should print: True
Common failure symptoms
| Symptom | Likely Cause |
|---|---|
torch.cuda.is_available() == False | Missing runtimeClassName: nvidia or NVIDIA driver not installed |
Pod stuck in ContainerCreating | RuntimeClass nvidia doesn't exist on the cluster |
CUDA error: no CUDA-capable device is detected | Device nodes not mounted; check runtime class |
| Pod runs but inference is slow | Fell back to CPU; check device availability |
Cluster prerequisites
- NVIDIA device plugin must be deployed (creates
nvidia.com/gpuresources) - NVIDIA container runtime must be installed on GPU nodes
- RuntimeClass
nvidiamust exist:apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia
Clean up (important: delete parents)
FlexInfer resources are hierarchical. Delete the parent, not the children.
- v1alpha2: delete the
Modelkubectl -n flexinfer-system delete model <name> - v1alpha1: delete the
ModelDeployment(not the Deployment)kubectl -n flexinfer-system delete modeldeployment <name>
For detailed cleanup guidance (including RAM caches and stuck Jobs), see the "Resource Cleanup Procedures" section in services/flexinfer/AGENTS.md.
AMD ROCm GPU requirements
Container setup
AMD GPUs require ROCm-compatible container images. FlexInfer uses ROCm variants automatically when AMD GPUs are detected.
# Verify ROCm device visibility
kubectl -n flexinfer-system exec <pod> -- ls /dev/dri/
# Should show: card0 renderD128 (or similar)
kubectl -n flexinfer-system exec <pod> -- rocm-smi
# Should show GPU(s) with temperature, utilization, etc.
Common AMD issues
| Symptom | Likely Cause |
|---|---|
No GPU detected | Missing ROCm container toolkit or device plugin |
HSA_STATUS_ERROR_OUT_OF_RESOURCES | Insufficient GPU memory; reduce batch size |
| Slow inference | Using CPU fallback; check /dev/kfd visibility |
Cluster prerequisites for AMD
- AMD device plugin deployed (creates
amd.com/gpuresources) - ROCm drivers installed on GPU nodes (6.0+ recommended)
- Container runtime configured for AMD GPUs
Backend-specific quirks
Ollama
- Model naming: Ollama uses
model:tagformat (e.g.,llama3.2:1b) - Pull on first use: Model downloads on first request if not cached
- Memory: Ollama manages its own memory; set
OLLAMA_NUM_PARALLELfor concurrency
vLLM
- Memory configuration: vLLM pre-allocates GPU memory
spec: config: gpu-memory-utilization: "0.9" # Use 90% of GPU memory - Tensor parallelism: For multi-GPU, set
tensor-parallel-size - Known issue: vLLM 0.4+ requires specific CUDA versions; check compatibility
MLC-LLM
- Model format: Requires MLC-compiled models (
.mlcformat) - Source URI: Use
HF://mlc-ai/<model>-MLCfor pre-compiled models - Maxwell GPUs (sm_52): Use the Maxwell-specific image variant
spec: image: registry.harbor.lan/flexinfer/mlc-llm:cuda-maxwell-v7
llama.cpp
- Model format: Requires GGUF format models
- CPU fallback: Works without GPU; useful for testing
- Memory mapping: Uses mmap by default; can reduce memory usage
spec: config: n-gpu-layers: "35" # Number of layers to offload to GPU
ComfyUI / Diffusers
- Image generation: These backends are for image models, not LLMs
- VRAM requirements: Typically need 8GB+ VRAM for image models
- Workflow files: ComfyUI requires workflow JSON in the request
Troubleshooting decision tree
Model not becoming Ready?
├── Check phase: kubectl describe model <name>
│ ├── Pending → No matching nodes (check GPU labels, node selector)
│ ├── Downloading → Network issue or invalid source URI
│ ├── Creating → Check pod events and logs
│ └── Error → Check conditions for specific reason
│
├── Pod not starting?
│ ├── ImagePullBackOff → Check image name/registry access
│ ├── ContainerCreating → Check RuntimeClass, volume mounts
│ └── CrashLoopBackOff → Check container logs
│
└── Pod running but model not responding?
├── Check model container logs
├── Verify port-forward to pod directly
└── Check health endpoint: /health or /v1/models
Metrics and monitoring
FlexInfer exposes Prometheus metrics:
# Scrape metrics from controller
kubectl -n flexinfer-system port-forward deploy/flexinfer-controller 8080:8080
curl localhost:8080/metrics
# Scrape metrics from proxy
kubectl -n flexinfer-system port-forward svc/flexinfer-proxy 8080:8080
curl localhost:8080/metrics
Key metrics:
flexinfer_models_total{phase}- Models by phaseflexinfer_proxy_requests_total{model,status}- Request countsflexinfer_proxy_queue_depth{model}- Pending requests per model