FlexInfer docs

GPU sharing

Time-share a GPU across multiple models.

FlexInfer supports two GPU-sharing models:

v1alpha2: Model.spec.gpu.shared (simple, homelab-friendly)
v1alpha1: GPUGroup (explicit policies + anti-thrashing)

v1alpha2: `spec.gpu.shared`

Models with the same shared value compete for the same GPU. Higher priority wins.

apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
  name: qwen3-8b
spec:
  backend: mlc-llm
  source: HF://mlc-ai/Qwen3-8B-q4f16_1-MLC
  gpu:
    shared: homelab-gpu
    priority: 100

v1alpha1: `GPUGroup`

GPUGroup enables:

one-active-at-a-time swapping
anti-thrashing controls
proxy-driven demand signaling based on real queued requests

See:

services/flexinfer/examples/gpugroup-multi-model.yaml
docs/DEPLOYMENT_RUNBOOK.md (operational notes)

Practical guidance

Use shared/GPUGroup when you have one GPU and multiple “sometimes” models.
Set priorities to encode “what should win” when demand arrives.
When possible, combine GPU sharing with caching (Memory or SharedPVC) to reduce swap latency.