Skip to main content
FlexInfer docs

Caching

Keep models warm (RAM) or warm-ish (PVC) to reduce cold start time.

Caching

FlexInfer supports caching at two layers:

  • Artifact caching: model weights are present locally (disk or RAM)
  • Runtime warmup: backend is already running and ready to serve

v1alpha2 cache strategies

Model.spec.cache.strategy:

  • Memory: fastest reloads; uses RAM (best for homelabs)
  • SharedPVC: persistent storage shared across nodes (slower than RAM, but durable)
  • None: always download/prepare on demand (slowest)

If spec.cache is omitted, FlexInfer infers a strategy based on GPU sharing.

v1alpha2 SharedPVC provisioning

For backends that mount /models:

  • If spec.cache.strategy: SharedPVC and spec.cache.pvcName is omitted, the controller auto-creates a PVC named <model>-cache.
  • Defaults:
    • spec.cache.storageClass: longhorn
    • spec.cache.size: 50Gi
  • status.cache.ready reflects whether the model artifact has been verified/prefetched onto the mounted volume.
  • status.cache.jobName / status.cache.jobPhase show the Job responsible for prefetch/verification.

If you use spec.source: pvc://<pvc-name>/<path>, FlexInfer mounts that PVC directly at /models and uses the /<path> subdirectory as the model path.

HuggingFace prefetch

For HF://... sources with SharedPVC, the controller runs a one-shot prefetch Job that materializes the repo under:

  • /models/<model-name>/

This makes cold starts deterministic and keeps status.cache.ready meaningful.

FlexInfer also sets HuggingFace cache env vars to keep secondary caches under the /models volume:

  • HF_HOME=/models/.cache/huggingface
  • HF_HUB_CACHE=/models/.cache/huggingface/hub

v1alpha1 ModelCache

ModelCache lets you pre-download (and optionally pre-warm) a model artifact.

Common pattern:

  • Create a ModelCache (SharedPVC or Memory)
  • Reference it from ModelDeployment.spec.modelCacheRef

Examples:

  • services/flexinfer/examples/ram-cached-models.yaml
  • docs/DEVELOPMENT.md (MLC-LLM caching example)

Troubleshooting cache issues

  • Inspect ModelCache status:
    kubectl -n flexinfer-system get modelcache -o wide
    kubectl -n flexinfer-system describe modelcache <name>
  • Check downloader logs (v1alpha1):
    kubectl -n flexinfer-system logs -l job-name=<cache-name>-downloader