Building a Multi-Model LLM Platform on Consumer GPUs
How I run multiple OpenAI-compatible LLM endpoints on a small K3s cluster with AMD Radeon GPUs, and what I had to do to make it stable.
Tech Stack
Overview
I wanted a boring interface and a boring operational loop:
- An OpenAI-compatible endpoint on my LAN so apps can use the standard SDKs.
- Multiple models (general chat, code, embeddings) without playing “which node has VRAM free?” every time.
- A fixed-cost ceiling. I’m fine with real hardware costs; I’m not fine with “surprise, the GPU bill doubled.”
This case study is what I built to get there: a small K3s cluster with dedicated AMD GPU workers running vLLM model servers behind a LiteLLM router.
The Challenge
There are two hard problems hiding inside “run a few models”:
- VRAM budgeting: you can fit a model, then OOM on the first real workload because KV cache and concurrency weren’t accounted for.
- Operational stability: drivers, runtime versions, and “it worked yesterday” ROCm weirdness matter more than they should.
I needed a solution that would:
- Support multiple concurrent LLM models
- Provide low-latency inference for interactive applications
- Keep cost predictable (power + amortized hardware instead of metered GPU-hours)
- Integrate cleanly with Kubernetes so upgrades, rollbacks, and monitoring are normal work
The Approach
Hardware Selection
I chose an AMD Radeon RX 7900 XTX because it’s the best “I can buy this at a store” GPU I could make work reliably for inference:
- 24GB VRAM: enough headroom for several 7B-class models with quantization and room left for KV cache.
- ROCm support: workable for the serving stack I wanted (with version pinning and some scars).
- Repairability: if something breaks, I can swap a card, reimage a node, and keep moving.
Architecture Design
The platform runs on a K3s cluster with dedicated GPU worker nodes:
Model Quantization Strategy
To fit multiple models on a single 24GB GPU, I leaned hard on quantization. The goal isn’t “max benchmark score,” it’s “don’t OOM when two users hit the endpoint at once.”
| Model | Original Size | Quantized Size | Method |
|---|---|---|---|
| Qwen 2.5 7B | 14GB | 4.5GB | AWQ 4-bit |
| Mistral 7B | 14GB | 4.5GB | AWQ 4-bit |
| CodeLlama 7B | 14GB | 4.5GB | AWQ 4-bit |
| Nomic Embed | 500MB | 500MB | FP16 |
Total VRAM usage: ~14GB, leaving headroom for KV cache and concurrent requests.
Implementation Details
vLLM Configuration
Each model runs in its own vLLM deployment with resource limits. (In real deployments: pin the image tag and treat driver/runtime versions as part of the contract.)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen
namespace: ai
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=Qwen/Qwen2.5-7B-Instruct-AWQ
- --quantization=awq
- --max-model-len=8192
- --gpu-memory-utilization=0.35
resources:
limits:
amd.com/gpu: 1
env:
- name: VLLM_ATTENTION_BACKEND
value: ROCM_FLASH
LiteLLM Router
LiteLLM provides a unified API gateway for all models:
model_list:
- model_name: gpt-4
litellm_params:
model: openai/qwen2.5-7b-instruct
api_base: http://vllm-qwen:8000/v1
- model_name: gpt-3.5-turbo
litellm_params:
model: openai/mistral-7b-instruct
api_base: http://vllm-mistral:8000/v1
- model_name: text-embedding-ada-002
litellm_params:
model: openai/nomic-embed-text
api_base: http://embedding:8000/v1
This allows applications to use familiar OpenAI SDK calls while routing to local models.
Monitoring Setup
Prometheus scrapes vLLM metrics for observability:
- Tokens per second: Track inference throughput
- Queue depth: Monitor request backlog
- GPU utilization: Ensure efficient resource usage
- Latency percentiles: P50, P95, P99 response times
Results
Performance Metrics
After optimization, the platform achieves:
| Metric | Value |
|---|---|
| P50 Latency | 85ms |
| P95 Latency | 145ms |
| P99 Latency | 220ms |
| Throughput | 45 tokens/sec |
| Concurrent requests | 8 |
Cost Analysis
I don’t pretend this is “free.” It’s just predictable:
- Power draw is steady-state and bounded.
- The hardware is amortized over time instead of metered per hour.
- I can leave services running without watching a bill.
Reliability
Over 6 months of operation:
- Uptime: 99.7%
- Unplanned restarts: 3 (all GPU driver related)
- Data loss: None
Lessons Learned
What Worked Well
- AWQ quantization: Minimal quality loss with 4x memory reduction
- vLLM's continuous batching: Excellent throughput for concurrent requests
- LiteLLM abstraction: Easy integration with existing OpenAI-based code
- Prometheus metrics: Essential for identifying bottlenecks
Challenges Encountered
- ROCm driver stability: Required careful version pinning
- Memory fragmentation: Needed periodic vLLM restarts initially
- Thermal management: Added dedicated GPU cooling solution
Future Improvements
- Add a second GPU for horizontal scaling
- Implement speculative decoding for faster inference
- Explore GGUF models with llama.cpp for CPU fallback
Conclusion
Building a local multi-model platform is viable if you treat it like any other production system:
- budget VRAM for the real workload (KV cache + concurrency),
- pin versions (drivers, runtime, containers),
- and instrument the boring stuff (latency, queue depth, GPU saturation) so you can debug regressions.
The key insight: you don’t need hyperscaler hardware to get useful, reliable endpoints. You need tight constraints and operational discipline.
Interested in similar solutions?
Let's discuss how I can help with your project.