AI Infrastructure3 monthsCompleted

Building a Multi-Model LLM Platform on Consumer GPUs

How I built a cost-effective multi-model LLM inference platform using AMD Radeon GPUs and Kubernetes, achieving 95% cost savings compared to cloud GPU pricing.

January 15, 2026·3 min read

95%

Cost Savings

vs cloud GPU pricing

Models Running

concurrent models

120ms

Avg Latency

token generation

Tech Stack

Infrastructure

Kubernetes

ML Serving

vLLMLiteLLM

Hardware

AMD Radeon 7900 XTX

Monitoring

PrometheusGrafana

Overview

Running large language models in production typically requires expensive cloud GPU instances. This case study explores how I built a local LLM inference platform using consumer-grade AMD Radeon GPUs, achieving comparable performance at a fraction of the cost.

The Challenge

Cloud GPU pricing for LLM inference is prohibitive for experimentation and personal projects:

A100 instances: $2-4/hour on major cloud providers
Continuous availability: 24/7 access costs $1,500-3,000/month
Multi-model requirements: Running multiple specialized models multiplies costs
Data privacy: Sensitive data shouldn't leave local infrastructure

I needed a solution that would:

Support multiple concurrent LLM models
Provide low-latency inference for interactive applications
Cost less than $50/month to operate
Integrate with existing Kubernetes infrastructure

The Approach

Hardware Selection

After researching options, I chose the AMD Radeon 7900 XTX for several reasons:

24GB VRAM: Sufficient for 7B-13B parameter models with quantization
ROCm support: AMD's GPU compute stack has mature vLLM support
Cost: $900 one-time vs $1,500+/month cloud rental
Power efficiency: ~300W TDP, manageable for homelab

Architecture Design

The platform runs on a K3s cluster with dedicated GPU worker nodes:

Architecture diagram showing a K3s cluster with control plane nodes, a GPU worker running vLLM with multiple quantized models on an AMD Radeon 7900 XTX, and a LiteLLM router providing an OpenAI-compatible endpoint. — **Figure 1.** Multi-model inference on consumer GPUs: vLLM on dedicated GPU workers, routed behind a LiteLLM gateway.

Model Quantization Strategy

To fit multiple models on a single 24GB GPU, I use aggressive quantization:

Model	Original Size	Quantized Size	Method
Qwen 2.5 7B	14GB	4.5GB	AWQ 4-bit
Mistral 7B	14GB	4.5GB	AWQ 4-bit
CodeLlama 7B	14GB	4.5GB	AWQ 4-bit
Nomic Embed	500MB	500MB	FP16

Total VRAM usage: ~14GB, leaving headroom for KV cache and concurrent requests.

Implementation Details

vLLM Configuration

Each model runs in its own vLLM deployment with resource limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen
  namespace: ai
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=Qwen/Qwen2.5-7B-Instruct-AWQ
            - --quantization=awq
            - --max-model-len=8192
            - --gpu-memory-utilization=0.35
          resources:
            limits:
              amd.com/gpu: 1
          env:
            - name: VLLM_ATTENTION_BACKEND
              value: ROCM_FLASH

LiteLLM Router

LiteLLM provides a unified API gateway for all models:

model_list:
  - model_name: gpt-4
    litellm_params:
      model: openai/qwen2.5-7b-instruct
      api_base: http://vllm-qwen:8000/v1
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: openai/mistral-7b-instruct
      api_base: http://vllm-mistral:8000/v1
  - model_name: text-embedding-ada-002
    litellm_params:
      model: openai/nomic-embed-text
      api_base: http://embedding:8000/v1

This allows applications to use familiar OpenAI SDK calls while routing to local models.

Monitoring Setup

Prometheus scrapes vLLM metrics for observability:

Tokens per second: Track inference throughput
Queue depth: Monitor request backlog
GPU utilization: Ensure efficient resource usage
Latency percentiles: P50, P95, P99 response times

Results

Performance Metrics

After optimization, the platform achieves:

Metric	Value
P50 Latency	85ms
P95 Latency	145ms
P99 Latency	220ms
Throughput	45 tokens/sec
Concurrent requests	8

Cost Analysis

Monthly operating costs:

Item	Cloud (A100)	Local
GPU compute	$2,190	$0
Electricity	N/A	~$25
Hardware amortization	N/A	~$25
Total	$2,190	~$50

95% cost reduction with equivalent performance for my workloads.

Reliability

Over 6 months of operation:

Uptime: 99.7%
Unplanned restarts: 3 (all GPU driver related)
Data loss: None

Lessons Learned

What Worked Well

AWQ quantization: Minimal quality loss with 4x memory reduction
vLLM's continuous batching: Excellent throughput for concurrent requests
LiteLLM abstraction: Easy integration with existing OpenAI-based code
Prometheus metrics: Essential for identifying bottlenecks

Challenges Encountered

ROCm driver stability: Required careful version pinning
Memory fragmentation: Needed periodic vLLM restarts initially
Thermal management: Added dedicated GPU cooling solution

Future Improvements

Add a second GPU for horizontal scaling
Implement speculative decoding for faster inference
Explore GGUF models with llama.cpp for CPU fallback

Conclusion

Building a local LLM platform is viable and cost-effective for developers and small teams. The combination of consumer GPUs, quantization, and modern serving frameworks like vLLM makes self-hosted AI inference practical.

The key insight: You don't need A100s to run useful LLMs. Consumer hardware with the right optimizations can serve production workloads at a fraction of cloud costs.

Interested in similar solutions?

Let's discuss how I can help with your project.

Get in Touch View More Case Studies