FlexInfer docs
Legacy API (v1alpha1)
ModelDeployment + ModelCache + GPUGroup (more knobs, more moving parts).
Legacy API (v1alpha1)
ai.flexinfer/v1alpha1 is the original multi-resource API. It is still supported, but the project direction is:
- Prefer
v1alpha2Modelfor new setups. - Use
v1alpha1only when you need knobs that haven’t been ported yet.
Resources
ModelDeployment: the primary “served model” objectModelCache: model artifact caching + pre-warmingGPUGroup: multi-model GPU sharing with preemption, anti-thrashing, and demand signaling
When v1alpha1 is still useful
- You want explicit
GPUGroupswap policies and anti-thrashing controls. - You’re relying on older example manifests and workflows.
- You want to pre-stage downloads via
ModelCacheand reuse a cache across multiple deployments.
Quick mental model
In v1alpha1 you typically create:
- (Optional)
ModelCacheto download and store model artifacts ModelDeploymentto run the backend container- (Optional)
GPUGroupto coordinate multipleModelDeployments sharing a GPU
Examples
services/flexinfer/examples/serverless-multi-backend.yamlservices/flexinfer/examples/gpugroup-multi-model.yamlservices/flexinfer/examples/ram-cached-models.yaml