Caching
Keep models warm (RAM) or warm-ish (PVC) to reduce cold start time.
Caching
FlexInfer supports caching at two layers:
- Artifact caching: model weights are present locally (disk or RAM)
- Runtime warmup: backend is already running and ready to serve
v1alpha2 cache strategies
Model.spec.cache.strategy:
Memory: fastest reloads; uses RAM (best for homelabs)SharedPVC: persistent storage shared across nodes (slower than RAM, but durable)None: always download/prepare on demand (slowest)
If spec.cache is omitted, FlexInfer infers a strategy based on GPU sharing.
v1alpha2 SharedPVC provisioning
For backends that mount /models:
- If
spec.cache.strategy: SharedPVCandspec.cache.pvcNameis omitted, the controller auto-creates a PVC named<model>-cache. - Defaults:
spec.cache.storageClass:longhornspec.cache.size:50Gi
status.cache.readyreflects whether the model artifact has been verified/prefetched onto the mounted volume.status.cache.jobName/status.cache.jobPhaseshow the Job responsible for prefetch/verification.
If you use spec.source: pvc://<pvc-name>/<path>, FlexInfer mounts that PVC directly at /models and uses the /<path> subdirectory as the model path.
HuggingFace prefetch
For HF://... sources with SharedPVC, the controller runs a one-shot prefetch Job that materializes the repo under:
/models/<model-name>/
This makes cold starts deterministic and keeps status.cache.ready meaningful.
FlexInfer also sets HuggingFace cache env vars to keep secondary caches under the /models volume:
HF_HOME=/models/.cache/huggingfaceHF_HUB_CACHE=/models/.cache/huggingface/hub
v1alpha1 ModelCache
ModelCache lets you pre-download (and optionally pre-warm) a model artifact.
Common pattern:
- Create a
ModelCache(SharedPVC or Memory) - Reference it from
ModelDeployment.spec.modelCacheRef
Examples:
services/flexinfer/examples/ram-cached-models.yamldocs/DEVELOPMENT.md(MLC-LLM caching example)
Troubleshooting cache issues
- Inspect
ModelCachestatus:kubectl -n flexinfer-system get modelcache -o wide kubectl -n flexinfer-system describe modelcache <name> - Check downloader logs (v1alpha1):
kubectl -n flexinfer-system logs -l job-name=<cache-name>-downloader