FlexInfer docs
GPU sharing
Time-share a GPU across multiple models.
GPU sharing
FlexInfer supports two GPU-sharing models:
- v1alpha2:
Model.spec.gpu.shared(simple, homelab-friendly) - v1alpha1:
GPUGroup(explicit policies + anti-thrashing)
v1alpha2: spec.gpu.shared
Models with the same shared value compete for the same GPU. Higher priority wins.
apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
name: qwen3-8b
spec:
backend: mlc-llm
source: HF://mlc-ai/Qwen3-8B-q4f16_1-MLC
gpu:
shared: homelab-gpu
priority: 100
v1alpha1: GPUGroup
GPUGroup enables:
- one-active-at-a-time swapping
- anti-thrashing controls
- proxy-driven demand signaling based on real queued requests
See:
services/flexinfer/examples/gpugroup-multi-model.yamldocs/DEPLOYMENT_RUNBOOK.md(operational notes)
Practical guidance
- Use
shared/GPUGroupwhen you have one GPU and multiple “sometimes” models. - Set priorities to encode “what should win” when demand arrives.
- When possible, combine GPU sharing with caching (
MemoryorSharedPVC) to reduce swap latency.