FlexInfer docs
Caching
Keep models warm (RAM) or warm-ish (PVC) to reduce cold start time.
Caching
FlexInfer supports caching at two layers:
- Artifact caching: model weights are present locally (disk or RAM)
- Runtime warmup: backend is already running and ready to serve
v1alpha2 cache strategies
Model.spec.cache.strategy:
Memory: fastest reloads; uses RAM (best for homelabs)SharedPVC: persistent storage shared across nodes (slower than RAM, but durable)None: always download/prepare on demand (slowest)
If spec.cache is omitted, FlexInfer infers a strategy based on GPU sharing.
v1alpha1 ModelCache
ModelCache lets you pre-download (and optionally pre-warm) a model artifact.
Common pattern:
- Create a
ModelCache(SharedPVC or Memory) - Reference it from
ModelDeployment.spec.modelCacheRef
Examples:
services/flexinfer/examples/ram-cached-models.yamldocs/DEVELOPMENT.md(MLC-LLM caching example)
Troubleshooting cache issues
- Inspect
ModelCachestatus:kubectl -n flexinfer-system get modelcache -o wide kubectl -n flexinfer-system describe modelcache <name> - Check downloader logs (v1alpha1):
kubectl -n flexinfer-system logs -l job-name=<cache-name>-downloader