FlexInfer docs

Proxy API

HTTP entrypoint for routing, activation, and discovery.

Proxy API

Binary: flexinfer-proxy (services/flexinfer/cmd/flexinfer-proxy)

The proxy is intentionally small: it doesn’t define a new inference protocol. Instead, it:

routes requests to backend services
uses the "model" selection semantics common to OpenAI-compatible APIs
queues during cold starts

Machine-readable OpenAPI spec:

services/flexinfer/specs/openapi/flexinfer-proxy.openapi.yaml

HTTP endpoints

`GET /healthz`

Returns 200 ok when the process is up.

`GET /metrics`

Prometheus metrics for proxy queueing + activation behavior.

`GET /v1/models`

OpenAI-compatible model listing. It aggregates:

v1alpha1 ModelDeployment resources
v1alpha2 Model resources

Response shape:

{
  "object": "list",
  "data": [
    {
      "id": "llama3-8b",
      "object": "model",
      "created": 0,
      "owned_by": "flexinfer",
      "metadata": {}
    }
  ]
}

`/*` (reverse proxy)

All other paths are forwarded to the selected model backend (subject to scale-to-zero/activation).

Model selection

The proxy extracts model name from:

X-Model-ID header
/model/<name>/... path prefix
OpenAI JSON body { "model": "<name>" }

Scale-to-zero contract

If a model has zero replicas (idle), the proxy may:

enqueue the request (bounded by PROXY_MAX_QUEUE_SIZE)
activate the model
wait for readiness (bounded by PROXY_QUEUE_TIMEOUT / PROXY_COLD_START_TIMEOUT)

Configuration details live in docs/CONFIGURATION.md.

Proxy API

Proxy API

HTTP endpoints

GET /healthz

GET /metrics

GET /v1/models

/* (reverse proxy)

Model selection

Scale-to-zero contract

`GET /healthz`

`GET /metrics`

`GET /v1/models`

`/*` (reverse proxy)