Skip to main content
Back to Case Studies
AI Infrastructure3 monthsCompleted

Building a Multi-Model LLM Platform on Consumer GPUs

How I built a cost-effective multi-model LLM inference platform using AMD Radeon GPUs and Kubernetes, achieving 95% cost savings compared to cloud GPU pricing.

January 15, 2026·3 min read
95%
Cost Savings
vs cloud GPU pricing
4
Models Running
concurrent models
120ms
Avg Latency
token generation

Tech Stack

Infrastructure
Kubernetes
ML Serving
vLLMLiteLLM
Hardware
AMD Radeon 7900 XTX
Monitoring
PrometheusGrafana

Overview

Running large language models in production typically requires expensive cloud GPU instances. This case study explores how I built a local LLM inference platform using consumer-grade AMD Radeon GPUs, achieving comparable performance at a fraction of the cost.

The Challenge

Cloud GPU pricing for LLM inference is prohibitive for experimentation and personal projects:

  • A100 instances: $2-4/hour on major cloud providers
  • Continuous availability: 24/7 access costs $1,500-3,000/month
  • Multi-model requirements: Running multiple specialized models multiplies costs
  • Data privacy: Sensitive data shouldn't leave local infrastructure

I needed a solution that would:

  1. Support multiple concurrent LLM models
  2. Provide low-latency inference for interactive applications
  3. Cost less than $50/month to operate
  4. Integrate with existing Kubernetes infrastructure

The Approach

Hardware Selection

After researching options, I chose the AMD Radeon 7900 XTX for several reasons:

  • 24GB VRAM: Sufficient for 7B-13B parameter models with quantization
  • ROCm support: AMD's GPU compute stack has mature vLLM support
  • Cost: $900 one-time vs $1,500+/month cloud rental
  • Power efficiency: ~300W TDP, manageable for homelab

Architecture Design

The platform runs on a K3s cluster with dedicated GPU worker nodes:

Architecture diagram showing a K3s cluster with control plane nodes, a GPU worker running vLLM with multiple quantized models on an AMD Radeon 7900 XTX, and a LiteLLM router providing an OpenAI-compatible endpoint.
Figure 1. Multi-model inference on consumer GPUs: vLLM on dedicated GPU workers, routed behind a LiteLLM gateway.

Model Quantization Strategy

To fit multiple models on a single 24GB GPU, I use aggressive quantization:

ModelOriginal SizeQuantized SizeMethod
Qwen 2.5 7B14GB4.5GBAWQ 4-bit
Mistral 7B14GB4.5GBAWQ 4-bit
CodeLlama 7B14GB4.5GBAWQ 4-bit
Nomic Embed500MB500MBFP16

Total VRAM usage: ~14GB, leaving headroom for KV cache and concurrent requests.

Implementation Details

vLLM Configuration

Each model runs in its own vLLM deployment with resource limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen
  namespace: ai
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=Qwen/Qwen2.5-7B-Instruct-AWQ
            - --quantization=awq
            - --max-model-len=8192
            - --gpu-memory-utilization=0.35
          resources:
            limits:
              amd.com/gpu: 1
          env:
            - name: VLLM_ATTENTION_BACKEND
              value: ROCM_FLASH

LiteLLM Router

LiteLLM provides a unified API gateway for all models:

model_list:
  - model_name: gpt-4
    litellm_params:
      model: openai/qwen2.5-7b-instruct
      api_base: http://vllm-qwen:8000/v1
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: openai/mistral-7b-instruct
      api_base: http://vllm-mistral:8000/v1
  - model_name: text-embedding-ada-002
    litellm_params:
      model: openai/nomic-embed-text
      api_base: http://embedding:8000/v1

This allows applications to use familiar OpenAI SDK calls while routing to local models.

Monitoring Setup

Prometheus scrapes vLLM metrics for observability:

  • Tokens per second: Track inference throughput
  • Queue depth: Monitor request backlog
  • GPU utilization: Ensure efficient resource usage
  • Latency percentiles: P50, P95, P99 response times

Results

Performance Metrics

After optimization, the platform achieves:

MetricValue
P50 Latency85ms
P95 Latency145ms
P99 Latency220ms
Throughput45 tokens/sec
Concurrent requests8

Cost Analysis

Monthly operating costs:

ItemCloud (A100)Local
GPU compute$2,190$0
ElectricityN/A~$25
Hardware amortizationN/A~$25
Total$2,190~$50

95% cost reduction with equivalent performance for my workloads.

Reliability

Over 6 months of operation:

  • Uptime: 99.7%
  • Unplanned restarts: 3 (all GPU driver related)
  • Data loss: None

Lessons Learned

What Worked Well

  1. AWQ quantization: Minimal quality loss with 4x memory reduction
  2. vLLM's continuous batching: Excellent throughput for concurrent requests
  3. LiteLLM abstraction: Easy integration with existing OpenAI-based code
  4. Prometheus metrics: Essential for identifying bottlenecks

Challenges Encountered

  1. ROCm driver stability: Required careful version pinning
  2. Memory fragmentation: Needed periodic vLLM restarts initially
  3. Thermal management: Added dedicated GPU cooling solution

Future Improvements

  • Add a second GPU for horizontal scaling
  • Implement speculative decoding for faster inference
  • Explore GGUF models with llama.cpp for CPU fallback

Conclusion

Building a local LLM platform is viable and cost-effective for developers and small teams. The combination of consumer GPUs, quantization, and modern serving frameworks like vLLM makes self-hosted AI inference practical.

The key insight: You don't need A100s to run useful LLMs. Consumer hardware with the right optimizations can serve production workloads at a fraction of cloud costs.

Interested in similar solutions?

Let's discuss how I can help with your project.