Models (v1alpha2)
The recommended single-resource API for running models.
Models (v1alpha2)
ai.flexinfer/v1alpha2 introduces a simplified CRD:
kind: Model- One resource per served model
- Optional GPU sharing via
spec.gpu.shared - Optional scale-to-zero via
spec.serverless - Optional proxy/LiteLLM discovery via annotations
Minimal example
apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
name: llama3-8b
spec:
backend: ollama
source: ollama://llama3:8b
spec fields (high-level)
spec.backend (required)
Backend plugin name. Common values:
ollamavllmmlc-llm(alias:mlc)llamacpp(alias:llama.cpp)diffuserscomfyuivllm-omni
Exact images/args/ports are defined by the backend registry in services/flexinfer/backend/.
spec.source (required)
Model source URI. Supported formats:
HF://org/model(HuggingFace model)ollama://model:tag(Ollama registry name)file:///path/to/model(host path inside the container)pvc://pvc-name/path(PVC-backed model path)
spec.gpu (optional)
Controls GPU allocation and optional time-sharing.
shared: group name; models with the same value compete for the same GPUpriority: higher wins preemption decisionscount: GPUs required (default 1)vramEstimateMB: hint for scheduling/binpacking
spec.serverless (optional)
Scale-to-zero behavior.
enabled: default true (homelab-friendly)idleTimeout: scale down after this idle windowcoldStartTimeout: request timeout budget during activation
spec.cache (optional)
Model caching strategy.
strategy:Memory,SharedPVC, orNonepvcName/storageClass/size: only relevant forSharedPVC
spec.config (optional)
Backend-specific configuration as JSON (passed through to the backend plugin).
Example:
spec:
config:
mode: server
maxNumSequence: 4
spec.resources / spec.nodeSelector (optional)
Pod resources and node selection. If you omit nodeSelector, the controller picks GPU nodes automatically.
spec.litellm (optional)
Adds litellm.flexinfer.ai/* annotations so a LiteLLM proxy can discover and route requests.
spec.serviceLabels (optional)
Semantic labels describing the model (for dynamic routing). Example: ["textgen","code","fast"].
status fields (high-level)
status reflects lifecycle + routing metadata:
phase:Idle,Pending,Loading,Ready,Preempted,Failedendpoint: service URL (cluster-internal)lastActiveTime: last time the proxy observed traffic (used for scale-to-zero)
Examples
services/flexinfer/examples/v1alpha2/model-basic.yamlservices/flexinfer/examples/v1alpha2/model-shared-gpu.yamlservices/flexinfer/examples/v1alpha2/model-amd-rocm.yamlservices/flexinfer/examples/v1alpha2/model-image-gen.yaml