Skip to main content
Product

FlexInfer

FlexInfer is the private inference and model-customization layer of the stack. It manages model lifecycle, GPU-aware scheduling, scale-to-zero activation, OpenAI-compatible routing, quantization, adapters, and model artifacts without pushing runtime control out to a shared SaaS layer.

Why it matters

Kubernetes-native control for private inference

FlexInfer brings model lifecycle, routing, scheduling, artifact delivery, customization, and runtime policy into one control plane for private and hybrid AI workloads.

Serverless OpenAI endpoints

Expose model backends behind OpenAI-compatible APIs with cold-start handling and routing controls built in.

Operationally legible rollout

Use Helm, GitOps, Prometheus, and explicit runtime controls instead of hidden side-channel deployment logic.

Mixed-GPU pragmatism

Support AMD, NVIDIA, and CPU-oriented paths with backend-specific images, tuning, and placement constraints.

Advanced runtime surface

Ship quantization, LoRA, OCI catalogs, image generation, and federation on the same control-plane model.

Customization loop

From model intent to private endpoint

The product pitch is not just serving a model. It is the whole path from artifact and adapter decisions to a routable endpoint that other systems can call through a standard API.

01

Bring the model inside

Treat model intent, source URI, cache policy, backend choice, and runtime constraints as Kubernetes state instead of a hand-run serving script.

02

Customize the artifact path

Use quantization workflows, OCI model catalogs, cache warmup, flash-loader preload, and LoRA adapter hot-swap to adapt models before traffic arrives.

03

Expose a normal API surface

Serve chat, completions, embeddings, and image flows through OpenAI-compatible routing so applications and orchestration systems do not need bespoke runtime adapters.

04

Operate the rollout

Keep scheduling, readiness, scale-to-zero activation, routing strategy, Prometheus metrics, and GitOps delivery visible to the platform team.

Feature map

What operators need before models serve traffic

Each area maps to a concrete operating concern: preparing artifacts, placing workloads, activating endpoints, routing traffic, and keeping the runtime observable.

Shipped surface

Runtime lifecycle

FlexInfer treats model runtime as a first-class Kubernetes workload instead of an ad hoc pod template.

  • Single v1alpha2 Model CRD for model lifecycle management
  • Backend plugins for vLLM, MLC-LLM, llama.cpp, Ollama, diffusers, ComfyUI, and related runtimes
  • Cache-gated rollout so pods do not become ready before artifacts are prepared
  • Flash-loader preload path for faster model staging from PVC to tmpfs
Shipped surface

Routing and activation

The proxy surface is designed for real application traffic, not just lab demos.

  • OpenAI-compatible proxy endpoint for chat, completions, embeddings, and image flows
  • Scale-to-zero activation with queueing, cold-start budgets, and bounded retries
  • Routing strategies for session affinity, prefix locality, and least-loaded dispatch
  • Multipart request handling for image-editing and model-aware request extraction
Shipped surface

GPU-aware placement

Placement decisions combine node facts, runtime constraints, and model demand instead of static node pinning alone.

  • Node agent labels GPU vendor, architecture, VRAM, and capacity hints
  • Scheduler extender scores nodes with benchmark results and runtime telemetry
  • Shared GPU groups support priority-based preemption and anti-thrashing controls
  • KV-cache and free-VRAM hints help reduce avoidable placement mistakes under load
Shipped surface

Model supply chain and advanced features

Model delivery, format choices, and adapter workflows are managed as part of the same runtime system.

  • Quantization pipelines and validation for GGUF, AWQ, GPTQ, EXL2, and FP8
  • OCI ModelCatalog support for Harbor, GHCR, and ECR-backed model delivery
  • LoRAAdapter hot-swap for vLLM-based adapter workflows
  • Cluster, FederatedModel, and GlobalProxy resources for multi-cluster execution
Architecture

Control plane, runtime lifecycle, federation, and delivery loop

The diagrams show the practical boundaries: how requests enter, how models become routable, how clusters participate, and how changes move through GitOps into running infrastructure.

Architecture

Control plane, activation, and runtime boundary

CRDs and policy live at the control plane, OpenAI-compatible proxying sits at the edge, and GPU-aware model execution stays inside the cluster.

FlexInfer architecture diagram showing applications hitting an OpenAI-compatible proxy, control-plane services managing model state and routing policy, and GPU worker nodes serving inference workloads inside the cluster.
Architecture

Runtime lifecycle and artifact preparation

How model intent, OCI artifacts, quantization, cache warmup, and readiness gates turn into a routable endpoint.

FlexInfer runtime lifecycle diagram showing model intent and artifact registries feeding cache preparation, quantization, preload, and ready-state serving.
Architecture

Federation and global routing

How GlobalProxy, Cluster, and FederatedModel resources coordinate cross-cluster reach without collapsing everything into one giant runtime.

FlexInfer federation diagram showing a global proxy routing requests across multiple cluster-local runtime surfaces based on federation resources and health.
Architecture

Deployment and operations loop

Git-driven delivery path for GPU workloads and product services running on the platform.

GitOps flow from GitLab through CI and Flux into a K3s cluster on Harvester with GPU nodes and product workloads.
Fit

Where FlexInfer fits best

FlexInfer is a strong fit when the runtime boundary matters as much as the model itself: private data, mixed GPU hardware, predictable rollouts, and product teams that want normal ops tooling instead of bespoke scripts.

Best for

Private or hybrid AI deployments, internal platforms, and teams running sensitive workloads on Kubernetes.

Operational model

Helm-friendly, GitOps-compatible, and instrumented enough to support real rollout, rollback, and troubleshooting loops.

Runtime breadth

Text, embeddings, image-generation, quantized artifacts, and multi-cluster execution share one operational model.

Related products

Context and orchestration around the runtime

FlexInfer is the runtime anchor. Loom Core and MentatLab cover the adjacent context-governance and operator-UX layers.

Companion product

Loom Core

Use Loom Core when agent and tool access need policy-aware context routing around the runtime layer FlexInfer provides.

Loom Core product
Companion product

MentatLab

Use MentatLab when operators need DAG orchestration UX and run visibility on top of the private platform stack.

MentatLab product