Building a Multi-Model LLM Platform on Consumer GPUs
How I built a cost-effective multi-model LLM inference platform using AMD Radeon GPUs and Kubernetes, achieving 95% cost savings compared to cloud GPU pricing.
Tech Stack
Overview
Running large language models in production typically requires expensive cloud GPU instances. This case study explores how I built a local LLM inference platform using consumer-grade AMD Radeon GPUs, achieving comparable performance at a fraction of the cost.
The Challenge
Cloud GPU pricing for LLM inference is prohibitive for experimentation and personal projects:
- A100 instances: $2-4/hour on major cloud providers
- Continuous availability: 24/7 access costs $1,500-3,000/month
- Multi-model requirements: Running multiple specialized models multiplies costs
- Data privacy: Sensitive data shouldn't leave local infrastructure
I needed a solution that would:
- Support multiple concurrent LLM models
- Provide low-latency inference for interactive applications
- Cost less than $50/month to operate
- Integrate with existing Kubernetes infrastructure
The Approach
Hardware Selection
After researching options, I chose the AMD Radeon 7900 XTX for several reasons:
- 24GB VRAM: Sufficient for 7B-13B parameter models with quantization
- ROCm support: AMD's GPU compute stack has mature vLLM support
- Cost: $900 one-time vs $1,500+/month cloud rental
- Power efficiency: ~300W TDP, manageable for homelab
Architecture Design
The platform runs on a K3s cluster with dedicated GPU worker nodes:
Model Quantization Strategy
To fit multiple models on a single 24GB GPU, I use aggressive quantization:
| Model | Original Size | Quantized Size | Method |
|---|---|---|---|
| Qwen 2.5 7B | 14GB | 4.5GB | AWQ 4-bit |
| Mistral 7B | 14GB | 4.5GB | AWQ 4-bit |
| CodeLlama 7B | 14GB | 4.5GB | AWQ 4-bit |
| Nomic Embed | 500MB | 500MB | FP16 |
Total VRAM usage: ~14GB, leaving headroom for KV cache and concurrent requests.
Implementation Details
vLLM Configuration
Each model runs in its own vLLM deployment with resource limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen
namespace: ai
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=Qwen/Qwen2.5-7B-Instruct-AWQ
- --quantization=awq
- --max-model-len=8192
- --gpu-memory-utilization=0.35
resources:
limits:
amd.com/gpu: 1
env:
- name: VLLM_ATTENTION_BACKEND
value: ROCM_FLASH
LiteLLM Router
LiteLLM provides a unified API gateway for all models:
model_list:
- model_name: gpt-4
litellm_params:
model: openai/qwen2.5-7b-instruct
api_base: http://vllm-qwen:8000/v1
- model_name: gpt-3.5-turbo
litellm_params:
model: openai/mistral-7b-instruct
api_base: http://vllm-mistral:8000/v1
- model_name: text-embedding-ada-002
litellm_params:
model: openai/nomic-embed-text
api_base: http://embedding:8000/v1
This allows applications to use familiar OpenAI SDK calls while routing to local models.
Monitoring Setup
Prometheus scrapes vLLM metrics for observability:
- Tokens per second: Track inference throughput
- Queue depth: Monitor request backlog
- GPU utilization: Ensure efficient resource usage
- Latency percentiles: P50, P95, P99 response times
Results
Performance Metrics
After optimization, the platform achieves:
| Metric | Value |
|---|---|
| P50 Latency | 85ms |
| P95 Latency | 145ms |
| P99 Latency | 220ms |
| Throughput | 45 tokens/sec |
| Concurrent requests | 8 |
Cost Analysis
Monthly operating costs:
| Item | Cloud (A100) | Local |
|---|---|---|
| GPU compute | $2,190 | $0 |
| Electricity | N/A | ~$25 |
| Hardware amortization | N/A | ~$25 |
| Total | $2,190 | ~$50 |
95% cost reduction with equivalent performance for my workloads.
Reliability
Over 6 months of operation:
- Uptime: 99.7%
- Unplanned restarts: 3 (all GPU driver related)
- Data loss: None
Lessons Learned
What Worked Well
- AWQ quantization: Minimal quality loss with 4x memory reduction
- vLLM's continuous batching: Excellent throughput for concurrent requests
- LiteLLM abstraction: Easy integration with existing OpenAI-based code
- Prometheus metrics: Essential for identifying bottlenecks
Challenges Encountered
- ROCm driver stability: Required careful version pinning
- Memory fragmentation: Needed periodic vLLM restarts initially
- Thermal management: Added dedicated GPU cooling solution
Future Improvements
- Add a second GPU for horizontal scaling
- Implement speculative decoding for faster inference
- Explore GGUF models with llama.cpp for CPU fallback
Conclusion
Building a local LLM platform is viable and cost-effective for developers and small teams. The combination of consumer GPUs, quantization, and modern serving frameworks like vLLM makes self-hosted AI inference practical.
The key insight: You don't need A100s to run useful LLMs. Consumer hardware with the right optimizations can serve production workloads at a fraction of cloud costs.
Interested in similar solutions?
Let's discuss how I can help with your project.