Quickstart
Install FlexInfer and deploy your first model.
Quickstart
This guide gets FlexInfer installed and serving a model in a Kubernetes cluster.
Prerequisites
- Kubernetes 1.25+
- At least one GPU-capable node (NVIDIA or AMD) with drivers + device plugin installed
kubectlandhelmv3
Install (Helm)
From the repo root:
helm upgrade --install flexinfer services/flexinfer/charts/flexinfer \
--namespace flexinfer-system \
--create-namespace
Verify install
kubectl -n flexinfer-system get pods
kubectl get crds | rg 'flexinfer' || true
You should see (names vary by chart release name):
- Controller:
*-controller - Proxy:
*-proxy - Scheduler extender:
*-scheduler - Node agent:
*-agent(DaemonSet)
Deploy a model (recommended: v1alpha2 Model)
Start with the minimal example:
kubectl apply -n flexinfer-system -f services/flexinfer/examples/v1alpha2/model-basic.yaml
Watch it progress:
kubectl -n flexinfer-system get models -w
kubectl -n flexinfer-system describe model llama3-8b
List models (proxy)
Port-forward the proxy Service:
kubectl -n flexinfer-system port-forward svc/flexinfer-proxy 8080:80
Then:
curl -s http://127.0.0.1:8080/v1/models | jq .
Send a request (OpenAI-style)
The proxy can infer the target model from the OpenAI-style JSON body:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-8b",
"messages": [{ "role": "user", "content": "Say hello in one sentence." }]
}' | jq .
If the model is scaled-to-zero, the proxy queues the request, triggers activation, and forwards once ready.
Clean up
kubectl -n flexinfer-system delete model llama3-8b
Backend-Specific Quickstarts
FlexInfer supports multiple inference backends. Choose one based on your GPU and requirements.
Ollama (Simplest)
Best for: Quick deployments, auto-downloading models, NVIDIA or AMD GPUs.
kubectl apply -n flexinfer-system -f examples/quickstart-ollama.yaml
Test:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama31-8b-ollama",
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'
Characteristics:
- Auto-pulls models from Ollama registry
- Cold start: 5-15 seconds
- Works with NVIDIA and AMD GPUs
vLLM (High-Throughput)
Best for: Production workloads, NVIDIA GPUs, high concurrency.
kubectl apply -n flexinfer-system -f examples/quickstart-vllm.yaml
Test:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen25-7b-vllm",
"messages": [{"role": "user", "content": "Explain quantum computing in 2 sentences."}],
"max_tokens": 100
}'
Characteristics:
- PagedAttention for efficient memory usage
- Cold start: 1-3 minutes (CUDA kernel compilation)
- NVIDIA GPUs only
- Best tokens/second throughput
MLC-LLM (AMD + NVIDIA)
Best for: AMD ROCm GPUs, optimized performance on both vendors.
kubectl apply -n flexinfer-system -f examples/quickstart-mlc-llm.yaml
Test:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b-mlc",
"messages": [{"role": "user", "content": "Write a haiku about programming."}]
}'
Characteristics:
- Works with both NVIDIA and AMD GPUs
- Pre-quantized models (q4f16, q4f32)
- Cold start: 10-30 seconds
- Best for AMD ROCm environments
Troubleshooting
Model stuck in "Pending" phase
-
Check GPU availability:
kubectl describe nodes | grep -A5 "nvidia.com/gpu\|amd.com/gpu" -
Check pod events:
kubectl describe pods -l flexinfer.ai/model=<model-name> -n flexinfer-system
Model fails to start
-
Check model logs:
kubectl logs -l flexinfer.ai/model=<model-name> -n flexinfer-system --tail=100 -
Common issues:
- OOM: Reduce
gpuMemoryUtilizationor use smaller model - CUDA errors: Ensure NVIDIA driver matches container CUDA version
- ROCm errors: Use
q4f32quantization instead ofq4f16
- OOM: Reduce
Requests timeout during cold start
Increase the cold start timeout in your Model spec:
spec:
serverless:
coldStartTimeout: 300s # 5 minutes
Or disable serverless for always-on:
spec:
serverless:
enabled: false