Product

FlexInfer

FlexInfer is the private inference and model-customization layer of the stack. It manages model lifecycle, GPU-aware scheduling, scale-to-zero activation, OpenAI-compatible routing, quantization, adapters, and model artifacts without pushing runtime control out to a shared SaaS layer.

Read FlexInfer docs Open playground Source code

Why it matters

Kubernetes-native control for private inference

FlexInfer brings model lifecycle, routing, scheduling, artifact delivery, customization, and runtime policy into one control plane for private and hybrid AI workloads.

Serverless OpenAI endpoints

Expose model backends behind OpenAI-compatible APIs with cold-start handling and routing controls built in.

Operationally legible rollout

Use Helm, GitOps, Prometheus, and explicit runtime controls instead of hidden side-channel deployment logic.

Mixed-GPU pragmatism

Support AMD, NVIDIA, and CPU-oriented paths with backend-specific images, tuning, and placement constraints.

Advanced runtime surface

Ship quantization, LoRA, OCI catalogs, image generation, federation, and declarative model experiments on the same control-plane model.

Customization loop

From model intent to private endpoint

The product pitch is not just serving a model. It is the whole path from artifact and adapter decisions to a routable endpoint that other systems can call through a standard API.

Bring the model inside

Treat model intent, source URI, cache policy, backend choice, and runtime constraints as Kubernetes state instead of a hand-run serving script.

Customize the artifact path

Use quantization workflows, OCI model catalogs, cache warmup, flash-loader preload, and LoRA adapter hot-swap to adapt models before traffic arrives.

Expose a normal API surface

Serve chat, completions, embeddings, and image flows through OpenAI-compatible routing so applications and orchestration systems do not need bespoke runtime adapters.

Operate the rollout

Keep scheduling, readiness, scale-to-zero activation, routing strategy, Prometheus metrics, and GitOps delivery visible to the platform team.

Feature map

What operators need before models serve traffic

Each area maps to a concrete operating concern: preparing artifacts, placing workloads, activating endpoints, routing traffic, and keeping the runtime observable.

Shipped surface

Runtime lifecycle

FlexInfer treats model runtime as a first-class Kubernetes workload instead of an ad hoc pod template.

Single v1alpha2 Model CRD for model lifecycle management
Backend plugins for vLLM, MLC-LLM, llama.cpp, Ollama, diffusers, ComfyUI, and related runtimes
Cache-gated rollout so pods do not become ready before artifacts are prepared
Flash-loader preload path for faster model staging from PVC to tmpfs

Shipped surface

Routing and activation

The proxy surface is designed for real application traffic, not just lab demos.

OpenAI-compatible proxy endpoint for chat, completions, embeddings, and image flows
Scale-to-zero activation with queueing, cold-start budgets, and bounded retries
Routing strategies for session affinity, prefix locality, and least-loaded dispatch
Multipart request handling for image-editing and model-aware request extraction

Shipped surface

GPU-aware placement

Placement decisions combine node facts, runtime constraints, and model demand instead of static node pinning alone.

Node agent labels GPU vendor, architecture, VRAM, and capacity hints
Scheduler extender scores nodes with benchmark results and runtime telemetry
Shared GPU groups support priority-based preemption and anti-thrashing controls
KV-cache and free-VRAM hints help reduce avoidable placement mistakes under load

Shipped surface

Model supply chain and advanced features

Model delivery, format choices, and adapter workflows are managed as part of the same runtime system.

Quantization pipelines and validation for GGUF, AWQ, GPTQ, EXL2, and FP8
OCI ModelCatalog support for Harbor, GHCR, and ECR-backed model delivery
LoRAAdapter hot-swap for vLLM-based adapter workflows
Cluster, FederatedModel, and GlobalProxy resources for multi-cluster execution

Shipped surface

Evaluation and experiments

Model changes are certified through declarative, bounded lifecycles instead of ad hoc canary scripts that mutate production.

ModelExperiment CRD runs an isolated candidate model through the gauntlet, records a typed pass/fail verdict, then releases the hardware
ModelBackfill schedules bounded evaluation jobs against warm models only during foreground-idle windows, yielding the moment demand returns
Benchmark gauntlet runs weekly and after every model publish, with chat-aware coherence probes and long-context needle benches
Autotune covers n-gram speculative decoding behind an over-optimization guard that vetoes throughput gains which regress any workload class

Shipped surface

Gaming mode and node repurposing

GPU nodes are declaratively repurposable: the same hardware can flip from inference fleet to cloud-gaming host and back.

GamingSession CRD drains inference from a node and starts a headless Sunshine host that Moonlight clients pair against
Sunshine, headless sway, Mesa RADV Vulkan, and VA-API hardware encode validated on RDNA3 hardware at native ultrawide resolution
Bounded sessions with expiry, opt-in idle auto-revert, crash supervision, and node-mode metrics
Deleting the session returns the node to the inference fleet with no manual cleanup

Architecture

Control plane, runtime lifecycle, federation, and delivery loop

The diagrams show the practical boundaries: how requests enter, how models become routable, how clusters participate, and how changes move through GitOps into running infrastructure.

Architecture

Control plane, activation, and runtime boundary

CRDs and policy live at the control plane, OpenAI-compatible proxying sits at the edge, and GPU-aware model execution stays inside the cluster.

Open full-size Download SVG

Architecture

Runtime lifecycle and artifact preparation

How model intent, OCI artifacts, quantization, cache warmup, and readiness gates turn into a routable endpoint.

FlexInfer runtime lifecycle diagram showing model intent and artifact registries feeding cache preparation, quantization, preload, and ready-state serving.

Open full-size Download SVG

Architecture

Federation and global routing

How GlobalProxy, Cluster, and FederatedModel resources coordinate cross-cluster reach without collapsing everything into one giant runtime.

FlexInfer federation diagram showing a global proxy routing requests across multiple cluster-local runtime surfaces based on federation resources and health.

Open full-size Download SVG

Architecture

Deployment and operations loop

Git-driven delivery path for GPU workloads and product services running on the platform.

GitOps flow from GitLab through CI and Flux into a K3s cluster on Harvester with GPU nodes and product workloads.

Open full-size Download SVG

Fit

Where FlexInfer fits best

FlexInfer is a strong fit when the runtime boundary matters as much as the model itself: private data, mixed GPU hardware, predictable rollouts, and product teams that want normal ops tooling instead of bespoke scripts.

Best for

Private or hybrid AI deployments, internal platforms, and teams running sensitive workloads on Kubernetes.

Operational model

Helm-friendly, GitOps-compatible, and instrumented enough to support real rollout, rollback, and troubleshooting loops.

Runtime breadth

Text, embeddings, image-generation, quantized artifacts, and multi-cluster execution share one operational model.

Context and orchestration around the runtime

FlexInfer is the runtime anchor. Loom Core and MentatLab cover the adjacent context-governance and operator-UX layers.

Companion product

Loom Core

Use Loom Core when agent and tool access need policy-aware context routing around the runtime layer FlexInfer provides.

Loom Core product →

Companion product

MentatLab

Use MentatLab when operators need DAG orchestration UX and run visibility on top of the private platform stack.

MentatLab product →