Models (v1alpha2)
The recommended single-resource API for running models.
Models (v1alpha2)
ai.flexinfer/v1alpha2 introduces a simplified CRD:
kind: Model- One resource per served model
- Optional GPU sharing via
spec.gpu.shared - Optional scale-to-zero via
spec.serverless - Optional proxy/LiteLLM discovery via annotations
Cluster snapshot (homelab)
For a point-in-time view of what is currently deployed in flexinfer-system (models, shared groups, LiteLLM aliases, benchmarks), see:
docs/user/flexinfer-system-snapshot.md
Minimal example
apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
name: llama3-8b
spec:
backend: ollama
source: ollama://llama3:8b
spec fields (high-level)
spec.backend (required)
Backend plugin name. Common values:
ollamavllmmlc-llm(alias:mlc)llamacpp(alias:llama.cpp)diffuserscomfyuivllm-omni
Exact images/args/ports are defined by the backend registry in services/flexinfer/backend/.
spec.source (required)
Model source URI. Supported formats:
HF://org/model(HuggingFace model)ollama://model:tag(Ollama registry name)file:///path/to/model(host path inside the container)pvc://pvc-name/path(PVC-backed model path)
Notes:
- For
backend: llamacppwithHF://sources, setspec.config.ggufFile: <file>.ggufto select a GGUF file within the downloaded repo. - FlexInfer now auto-prefetches only required llama.cpp files for HF GGUF repos:
- If
spec.config.ggufFileis set, the prefetch job downloads just that GGUF (plus optional relativespec.config.mmproj). - This avoids pulling full multi-quant repos by default.
- If
- Optional advanced download controls (all in
spec.config):hfAllowPatterns: list (or comma-separated string) passed tosnapshot_download(..., allow_patterns=...)hfIgnorePatterns: list (or comma-separated string) passed tosnapshot_download(..., ignore_patterns=...)hfRevision: revision/tag/commit passed tosnapshot_download(..., revision=...)
spec.gpu (optional)
Controls GPU allocation and optional time-sharing.
vendor:auto,nvidia,amd, orcpushared: group name; models with the same value compete for the same GPUpriority: higher wins preemption decisionscount: GPUs required (default 1)vramEstimateMB: hint for scheduling/binpacking
Notes:
- If you omit
spec.gpu, the model runs CPU-only. - If you set
spec.gpu.vendor: cpu, omitspec.gpu.count(it is rejected by CRD validation). - If you set
spec.gpu.vendor: nvidiaoramd, the controller will only schedule on matching GPU nodes (it will not auto-fallback to the other vendor). vramEstimateMBis optional but strongly recommended on mixed GPU clusters (e.g., Maxwell 6GB + gfx1100 24GB). The scheduler extender uses it, along with the node agent'sflexinfer.ai/gpu-free-memoryannotation, to avoid placing large models onto low-VRAM nodes.
spec.serverless (optional)
Scale-to-zero behavior.
enabled: default true (homelab-friendly)idleTimeout: scale down after this idle windowcoldStartTimeout: request timeout budget during activation
spec.cache (optional)
Model caching strategy.
strategy:Memory,SharedPVC, orNonepvcName/storageClass/size: only relevant forSharedPVC
Notes:
- If
spec.cache.strategy: SharedPVCandspec.cache.pvcNameis omitted, the controller auto-creates a PVC named<model>-cache. - If
spec.sourceispvc://..., FlexInfer mounts that PVC at/modelsand ignoresspec.cachefor volume provisioning.
spec.config (optional)
Backend-specific configuration as JSON (passed through to the backend plugin).
Example:
spec:
config:
mode: server
maxNumSequence: 4
Maxwell (sm_5x) notes
On NVIDIA Maxwell GPUs (compute capability 5.x, e.g. GTX 980 Ti sm_52), FlexInfer enforces backend compatibility:
vllm,vllm-omni, anddiffusersare rejected on Maxwell.mlc-llmrequires a pre-compiled library (FP32 quantization only). Prefer compiling to/models/<modelName>/maxwell-lib.soand settingjitPolicy: READONLY.
Example:
spec:
backend: mlc-llm
gpu:
vendor: nvidia
config:
jitPolicy: READONLY
# Optional if you compile to /models/<modelName>/maxwell-lib.so:
# modelLibPath: /models/<modelName>/maxwell-lib.so
spec.resources / spec.nodeSelector (optional)
Pod resources and node selection. If you omit nodeSelector, the controller picks GPU nodes automatically.
spec.litellm (optional)
Adds litellm.flexinfer.ai/* annotations so a LiteLLM proxy can discover and route requests.
spec.serviceLabels (optional)
Semantic labels describing the model (for dynamic routing). Example: ["textgen","code","fast"].
status fields (high-level)
status reflects lifecycle + routing metadata:
phase:Idle,Pending,Loading,Ready,Preempted,Failedendpoint: service URL (cluster-internal)lastActiveTime: last time the proxy observed traffic (used for scale-to-zero)cache: cache readiness details (ready, plus the prefetch/check Job state)
Examples
services/flexinfer/examples/v1alpha2/model-basic.yamlservices/flexinfer/examples/v1alpha2/model-shared-gpu.yamlservices/flexinfer/examples/v1alpha2/model-amd-rocm.yamlservices/flexinfer/examples/v1alpha2/model-image-gen.yaml