FlexInfer Documentation

This directory is the canonical documentation for services/flexinfer.

Getting Started

Guide	Description
Installation	Install FlexInfer on your cluster
Quickstart	Deploy your first model in 5 minutes
Configuration	Environment variables and settings

User Guides

Guide	Description
Models (v1alpha2)	Creating and managing Model resources
Proxy & Requests	Sending inference requests
API Compatibility	OpenAI API compatibility
Routing	Session affinity, prefix routing, load balancing
GPU Sharing	Time-sharing GPUs between models
Caching	Model weight caching strategies
Quantization Pipelines	GGUF/AWQ/GPTQ ModelCache quantization workflows
Operations	Day-2 operations and troubleshooting

Developer Guides

Guide	Description
Local Development	Setting up a dev environment
Architecture	System design and components
Backends	Supported inference backends
Testing	Running tests
Release & Images	Building and releasing

Specifications

Spec	Description
CRDs	Custom Resource Definitions
Proxy API	Proxy HTTP endpoints
Scheduler Extender	Kubernetes scheduler integration
Labels & Annotations	Resource metadata conventions
Metrics	Prometheus metrics reference

Planning

Document	Description
Feature Inventory	Current feature status
Next Roadmap	Upcoming work
Multi-Tenancy Design	M1 namespace isolation foundation
Phase 1	Controller & API hardening
Phase 2	Serverless hardening
Phase 3	Routing & performance
Phase 4	Operational polish
Phase 5	Multi-cluster (future)

Quick Links

Need help? Start with Quickstart then Operations
Debugging issues? See troubleshooting in Operations
API reference? See CRDs and Proxy API

Site Integration

These docs are intentionally written to be "site-syncable" (plain Markdown, optional YAML frontmatter). services/flexinfer-site can copy and render them as part of the playground/docs experience.

Navigation is defined in docs/nav.yaml.