Skip to main content
FlexInfer docs
Updated

Getting Started

FlexInfer Documentation

This directory is the canonical documentation for services/flexinfer.

Getting Started

GuideDescription
InstallationInstall FlexInfer on your cluster
QuickstartDeploy your first model in 5 minutes
ConfigurationEnvironment variables and settings

User Guides

GuideDescription
Models (v1alpha2)Creating and managing Model resources
Proxy & RequestsSending inference requests
API CompatibilityOpenAI API compatibility
RoutingSession affinity, prefix routing, load balancing
GPU SharingTime-sharing GPUs between models
CachingModel weight caching strategies
Quantization PipelinesGGUF/AWQ/GPTQ ModelCache quantization workflows
OperationsDay-2 operations and troubleshooting

Developer Guides

GuideDescription
Local DevelopmentSetting up a dev environment
ArchitectureSystem design and components
BackendsSupported inference backends
TestingRunning tests
Release & ImagesBuilding and releasing

Specifications

SpecDescription
CRDsCustom Resource Definitions
Proxy APIProxy HTTP endpoints
Scheduler ExtenderKubernetes scheduler integration
Labels & AnnotationsResource metadata conventions
MetricsPrometheus metrics reference

Planning

DocumentDescription
Feature InventoryCurrent feature status
Next RoadmapUpcoming work
Multi-Tenancy DesignM1 namespace isolation foundation
Phase 1Controller & API hardening
Phase 2Serverless hardening
Phase 3Routing & performance
Phase 4Operational polish
Phase 5Multi-cluster (future)

Site Integration

These docs are intentionally written to be "site-syncable" (plain Markdown, optional YAML frontmatter). services/flexinfer-site can copy and render them as part of the playground/docs experience.

Navigation is defined in docs/nav.yaml.