FlexInfer
FlexInfer is the private inference and model-customization layer of the stack. It manages model lifecycle, GPU-aware scheduling, scale-to-zero activation, OpenAI-compatible routing, quantization, adapters, and model artifacts without pushing runtime control out to a shared SaaS layer.
Kubernetes-native control for private inference
FlexInfer brings model lifecycle, routing, scheduling, artifact delivery, customization, and runtime policy into one control plane for private and hybrid AI workloads.
Serverless OpenAI endpoints
Expose model backends behind OpenAI-compatible APIs with cold-start handling and routing controls built in.
Operationally legible rollout
Use Helm, GitOps, Prometheus, and explicit runtime controls instead of hidden side-channel deployment logic.
Mixed-GPU pragmatism
Support AMD, NVIDIA, and CPU-oriented paths with backend-specific images, tuning, and placement constraints.
Advanced runtime surface
Ship quantization, LoRA, OCI catalogs, image generation, and federation on the same control-plane model.
From model intent to private endpoint
The product pitch is not just serving a model. It is the whole path from artifact and adapter decisions to a routable endpoint that other systems can call through a standard API.
Bring the model inside
Treat model intent, source URI, cache policy, backend choice, and runtime constraints as Kubernetes state instead of a hand-run serving script.
Customize the artifact path
Use quantization workflows, OCI model catalogs, cache warmup, flash-loader preload, and LoRA adapter hot-swap to adapt models before traffic arrives.
Expose a normal API surface
Serve chat, completions, embeddings, and image flows through OpenAI-compatible routing so applications and orchestration systems do not need bespoke runtime adapters.
Operate the rollout
Keep scheduling, readiness, scale-to-zero activation, routing strategy, Prometheus metrics, and GitOps delivery visible to the platform team.
What operators need before models serve traffic
Each area maps to a concrete operating concern: preparing artifacts, placing workloads, activating endpoints, routing traffic, and keeping the runtime observable.
Runtime lifecycle
FlexInfer treats model runtime as a first-class Kubernetes workload instead of an ad hoc pod template.
- Single v1alpha2 Model CRD for model lifecycle management
- Backend plugins for vLLM, MLC-LLM, llama.cpp, Ollama, diffusers, ComfyUI, and related runtimes
- Cache-gated rollout so pods do not become ready before artifacts are prepared
- Flash-loader preload path for faster model staging from PVC to tmpfs
Routing and activation
The proxy surface is designed for real application traffic, not just lab demos.
- OpenAI-compatible proxy endpoint for chat, completions, embeddings, and image flows
- Scale-to-zero activation with queueing, cold-start budgets, and bounded retries
- Routing strategies for session affinity, prefix locality, and least-loaded dispatch
- Multipart request handling for image-editing and model-aware request extraction
GPU-aware placement
Placement decisions combine node facts, runtime constraints, and model demand instead of static node pinning alone.
- Node agent labels GPU vendor, architecture, VRAM, and capacity hints
- Scheduler extender scores nodes with benchmark results and runtime telemetry
- Shared GPU groups support priority-based preemption and anti-thrashing controls
- KV-cache and free-VRAM hints help reduce avoidable placement mistakes under load
Model supply chain and advanced features
Model delivery, format choices, and adapter workflows are managed as part of the same runtime system.
- Quantization pipelines and validation for GGUF, AWQ, GPTQ, EXL2, and FP8
- OCI ModelCatalog support for Harbor, GHCR, and ECR-backed model delivery
- LoRAAdapter hot-swap for vLLM-based adapter workflows
- Cluster, FederatedModel, and GlobalProxy resources for multi-cluster execution
Control plane, runtime lifecycle, federation, and delivery loop
The diagrams show the practical boundaries: how requests enter, how models become routable, how clusters participate, and how changes move through GitOps into running infrastructure.
Control plane, activation, and runtime boundary
CRDs and policy live at the control plane, OpenAI-compatible proxying sits at the edge, and GPU-aware model execution stays inside the cluster.
Runtime lifecycle and artifact preparation
How model intent, OCI artifacts, quantization, cache warmup, and readiness gates turn into a routable endpoint.
Federation and global routing
How GlobalProxy, Cluster, and FederatedModel resources coordinate cross-cluster reach without collapsing everything into one giant runtime.
Deployment and operations loop
Git-driven delivery path for GPU workloads and product services running on the platform.
Where FlexInfer fits best
FlexInfer is a strong fit when the runtime boundary matters as much as the model itself: private data, mixed GPU hardware, predictable rollouts, and product teams that want normal ops tooling instead of bespoke scripts.
Private or hybrid AI deployments, internal platforms, and teams running sensitive workloads on Kubernetes.
Helm-friendly, GitOps-compatible, and instrumented enough to support real rollout, rollback, and troubleshooting loops.
Text, embeddings, image-generation, quantized artifacts, and multi-cluster execution share one operational model.
Context and orchestration around the runtime
FlexInfer is the runtime anchor. Loom Core and MentatLab cover the adjacent context-governance and operator-UX layers.
Loom Core
Use Loom Core when agent and tool access need policy-aware context routing around the runtime layer FlexInfer provides.
Loom Core product →MentatLab
Use MentatLab when operators need DAG orchestration UX and run visibility on top of the private platform stack.
MentatLab product →