Skip to main content
FlexInfer docs

Caching

Keep models warm (RAM) or warm-ish (PVC) to reduce cold start time.

Caching

FlexInfer supports caching at two layers:

  • Artifact caching: model weights are present locally (disk or RAM)
  • Runtime warmup: backend is already running and ready to serve

v1alpha2 cache strategies

Model.spec.cache.strategy:

  • Memory: fastest reloads; uses RAM (best for homelabs)
  • SharedPVC: persistent storage shared across nodes (slower than RAM, but durable)
  • None: always download/prepare on demand (slowest)

If spec.cache is omitted, FlexInfer infers a strategy based on GPU sharing.

v1alpha1 ModelCache

ModelCache lets you pre-download (and optionally pre-warm) a model artifact.

Common pattern:

  • Create a ModelCache (SharedPVC or Memory)
  • Reference it from ModelDeployment.spec.modelCacheRef

Examples:

  • services/flexinfer/examples/ram-cached-models.yaml
  • docs/DEVELOPMENT.md (MLC-LLM caching example)

Troubleshooting cache issues

  • Inspect ModelCache status:
    kubectl -n flexinfer-system get modelcache -o wide
    kubectl -n flexinfer-system describe modelcache <name>
  • Check downloader logs (v1alpha1):
    kubectl -n flexinfer-system logs -l job-name=<cache-name>-downloader