FlexInfer docs
Proxy & requests
How to route requests and how scale-to-zero activation works.
Proxy & requests
The FlexInfer proxy (flexinfer-proxy) is the entrypoint for:
- OpenAI-style model selection (
"model": "...") - Scale-to-zero activation with request queueing
- GPUGroup demand signaling (for shared-GPU swaps)
- Model discovery via
GET /v1/models
Endpoints
GET /healthz→200 okGET /metrics→ Prometheus metricsGET /v1/models→ OpenAI-compatible model list/*→ reverse proxy to the active model backend
Model selection (priority order)
The proxy determines the target model name using:
X-Model-IDHTTP header- URL prefix
/model/<name>/...(the prefix is stripped before proxying upstream) - OpenAI JSON body field:
{ "model": "<name>" }(forPOST+application/json)
OpenAI-style usage (recommended)
curl -s http://proxy/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-8b",
"messages": [{ "role": "user", "content": "Explain KV cache in one paragraph." }]
}'
The proxy forwards the request to the backend Service for that model.
Scale-to-zero behavior
When a model is idle (replicas = 0), the proxy:
- Queues the request (bounded queue)
- Triggers activation (scale-up)
- Waits until the backend becomes ready (bounded timeout)
- Drains queued requests to the backend
Timeouts and queue sizing are configured via env vars (see docs/CONFIGURATION.md).
GPUGroup demand signaling (v1alpha1)
For models in a GPUGroup, only one model is active at a time. When a request arrives for an inactive model, the proxy:
- queues the request
- writes per-model queue depth annotations to the
GPUGroup:flexinfer.ai/queue.<modelName>: "<depth>"flexinfer.ai/queue-since.<modelName>: "<rfc3339>"
- waits for the GPUGroup controller to swap the active model
Troubleshooting
- List models:
curl -s http://proxy/v1/models | jq . - Watch proxy logs:
kubectl -n flexinfer-system logs -f deploy/flexinfer-proxy - Watch model readiness:
- v1alpha2:
kubectl -n flexinfer-system get models -w - v1alpha1:
kubectl -n flexinfer-system get modeldeployments -w
- v1alpha2: