FlexInfer docs
Architecture
Components, control loops, and code layout.
Architecture
FlexInfer is a set of cooperating components:
flexinfer-agent: node-level hardware discovery + labelingflexinfer-manager: Kubernetes controller manager (CRDs → Deployments/Services)flexinfer-sched: scheduler extender (/filter,/score) for placement decisionsflexinfer-bench: benchmark runner for tokens/sec measurement (v1alpha1 workflow)flexinfer-proxy: request router + scale-to-zero activator + GPUGroup demand signaling
Primary contracts
- CRDs:
- v1alpha2:
Model - v1alpha1:
ModelDeployment,ModelCache,GPUGroup
- v1alpha2:
- Scheduler extender HTTP API (
kube-schedulerextender v1) - Proxy HTTP behavior (OpenAI-style model selection)
- Labels / annotations for discovery and routing
See docs/specs/README.md for the contract-level docs.
Code layout
High-signal directories:
services/flexinfer/api/: Go types for CRDs (source of schema)services/flexinfer/controllers/: reconciliation logicservices/flexinfer/backend/: backend plugin registry (images/args/probes per backend)services/flexinfer/scheduler/: scheduler extender logicservices/flexinfer/agents/: node agent + benchmarker implementationsservices/flexinfer/cmd/: main packages for binaries
Control-plane flow (simplified)
- Agent labels GPU nodes (
flexinfer.ai/gpu.*) - Controller reconciles CRDs into:
- Deployments (backend pods)
- Services (stable in-cluster endpoint)
- Scheduler extender biases placement using:
- benchmark results (tokens/sec)
- node annotations (util/cost/cache hints, if present)
- Proxy routes requests and triggers scale-up when needed