FlexInfer docs

Caching

Keep models warm (RAM) or warm-ish (PVC) to reduce cold start time.

FlexInfer supports caching at two layers:

v1alpha2 cache strategies

Model.spec.cache.strategy:

Memory: fastest reloads; uses RAM (best for homelabs)
SharedPVC: persistent storage shared across nodes (slower than RAM, but durable)
None: always download/prepare on demand (slowest)

If spec.cache is omitted, FlexInfer infers a strategy based on GPU sharing.

ModelCache lets you pre-download (and optionally pre-warm) a model artifact.

Common pattern:

Examples:

Inspect ModelCache status:

kubectl -n flexinfer-system get modelcache -o wide
kubectl -n flexinfer-system describe modelcache <name>

Check downloader logs (v1alpha1):

kubectl -n flexinfer-system logs -l job-name=<cache-name>-downloader