Skip to main content
FlexInfer docs

GPU sharing

Time-share a GPU across multiple models.

GPU sharing

FlexInfer supports two GPU-sharing models:

  • v1alpha2: Model.spec.gpu.shared (simple, homelab-friendly)
  • v1alpha1: GPUGroup (explicit policies + anti-thrashing)

v1alpha2: spec.gpu.shared

Models with the same shared value compete for the same GPU. Higher priority wins.

apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
  name: qwen3-8b
spec:
  backend: mlc-llm
  source: HF://mlc-ai/Qwen3-8B-q4f16_1-MLC
  gpu:
    shared: homelab-gpu
    priority: 100

v1alpha1: GPUGroup

GPUGroup enables:

  • one-active-at-a-time swapping
  • anti-thrashing controls
  • proxy-driven demand signaling based on real queued requests

See:

  • services/flexinfer/examples/gpugroup-multi-model.yaml
  • docs/DEPLOYMENT_RUNBOOK.md (operational notes)

Practical guidance

  • Use shared/GPUGroup when you have one GPU and multiple “sometimes” models.
  • Set priorities to encode “what should win” when demand arrives.
  • When possible, combine GPU sharing with caching (Memory or SharedPVC) to reduce swap latency.