Skip to main content
Back to Case Studies
AI Infrastructure3 monthsCompleted

Building a Multi-Model LLM Platform on Consumer GPUs

How I run multiple OpenAI-compatible LLM endpoints on a small K3s cluster with AMD Radeon GPUs, and what I had to do to make it stable.

January 15, 2026·4 min read
24GB
GPU VRAM
AMD Radeon RX 7900 XTX
4
Models Served
behind a single router
OpenAI API
Interface
apps use standard SDKs

Tech Stack

Infrastructure
Kubernetes
ML Serving
vLLMLiteLLM
Hardware
AMD Radeon 7900 XTX
Monitoring
PrometheusGrafana

Overview

I wanted a boring interface and a boring operational loop:

  • An OpenAI-compatible endpoint on my LAN so apps can use the standard SDKs.
  • Multiple models (general chat, code, embeddings) without playing “which node has VRAM free?” every time.
  • A fixed-cost ceiling. I’m fine with real hardware costs; I’m not fine with “surprise, the GPU bill doubled.”

This case study is what I built to get there: a small K3s cluster with dedicated AMD GPU workers running vLLM model servers behind a LiteLLM router.

The Challenge

There are two hard problems hiding inside “run a few models”:

  • VRAM budgeting: you can fit a model, then OOM on the first real workload because KV cache and concurrency weren’t accounted for.
  • Operational stability: drivers, runtime versions, and “it worked yesterday” ROCm weirdness matter more than they should.

I needed a solution that would:

  1. Support multiple concurrent LLM models
  2. Provide low-latency inference for interactive applications
  3. Keep cost predictable (power + amortized hardware instead of metered GPU-hours)
  4. Integrate cleanly with Kubernetes so upgrades, rollbacks, and monitoring are normal work

The Approach

Hardware Selection

I chose an AMD Radeon RX 7900 XTX because it’s the best “I can buy this at a store” GPU I could make work reliably for inference:

  • 24GB VRAM: enough headroom for several 7B-class models with quantization and room left for KV cache.
  • ROCm support: workable for the serving stack I wanted (with version pinning and some scars).
  • Repairability: if something breaks, I can swap a card, reimage a node, and keep moving.

Architecture Design

The platform runs on a K3s cluster with dedicated GPU worker nodes:

Architecture diagram showing a K3s cluster with control plane nodes, a GPU worker running vLLM with multiple quantized models on an AMD Radeon 7900 XTX, and a LiteLLM router providing an OpenAI-compatible endpoint.
Figure 1. Multi-model inference on consumer GPUs: vLLM on dedicated GPU workers, routed behind a LiteLLM gateway.

Model Quantization Strategy

To fit multiple models on a single 24GB GPU, I leaned hard on quantization. The goal isn’t “max benchmark score,” it’s “don’t OOM when two users hit the endpoint at once.”

ModelOriginal SizeQuantized SizeMethod
Qwen 2.5 7B14GB4.5GBAWQ 4-bit
Mistral 7B14GB4.5GBAWQ 4-bit
CodeLlama 7B14GB4.5GBAWQ 4-bit
Nomic Embed500MB500MBFP16

Total VRAM usage: ~14GB, leaving headroom for KV cache and concurrent requests.

Implementation Details

vLLM Configuration

Each model runs in its own vLLM deployment with resource limits. (In real deployments: pin the image tag and treat driver/runtime versions as part of the contract.)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen
  namespace: ai
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=Qwen/Qwen2.5-7B-Instruct-AWQ
            - --quantization=awq
            - --max-model-len=8192
            - --gpu-memory-utilization=0.35
          resources:
            limits:
              amd.com/gpu: 1
          env:
            - name: VLLM_ATTENTION_BACKEND
              value: ROCM_FLASH

LiteLLM Router

LiteLLM provides a unified API gateway for all models:

model_list:
  - model_name: gpt-4
    litellm_params:
      model: openai/qwen2.5-7b-instruct
      api_base: http://vllm-qwen:8000/v1
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: openai/mistral-7b-instruct
      api_base: http://vllm-mistral:8000/v1
  - model_name: text-embedding-ada-002
    litellm_params:
      model: openai/nomic-embed-text
      api_base: http://embedding:8000/v1

This allows applications to use familiar OpenAI SDK calls while routing to local models.

Monitoring Setup

Prometheus scrapes vLLM metrics for observability:

  • Tokens per second: Track inference throughput
  • Queue depth: Monitor request backlog
  • GPU utilization: Ensure efficient resource usage
  • Latency percentiles: P50, P95, P99 response times

Results

Performance Metrics

After optimization, the platform achieves:

MetricValue
P50 Latency85ms
P95 Latency145ms
P99 Latency220ms
Throughput45 tokens/sec
Concurrent requests8

Cost Analysis

I don’t pretend this is “free.” It’s just predictable:

  • Power draw is steady-state and bounded.
  • The hardware is amortized over time instead of metered per hour.
  • I can leave services running without watching a bill.

Reliability

Over 6 months of operation:

  • Uptime: 99.7%
  • Unplanned restarts: 3 (all GPU driver related)
  • Data loss: None

Lessons Learned

What Worked Well

  1. AWQ quantization: Minimal quality loss with 4x memory reduction
  2. vLLM's continuous batching: Excellent throughput for concurrent requests
  3. LiteLLM abstraction: Easy integration with existing OpenAI-based code
  4. Prometheus metrics: Essential for identifying bottlenecks

Challenges Encountered

  1. ROCm driver stability: Required careful version pinning
  2. Memory fragmentation: Needed periodic vLLM restarts initially
  3. Thermal management: Added dedicated GPU cooling solution

Future Improvements

  • Add a second GPU for horizontal scaling
  • Implement speculative decoding for faster inference
  • Explore GGUF models with llama.cpp for CPU fallback

Conclusion

Building a local multi-model platform is viable if you treat it like any other production system:

  • budget VRAM for the real workload (KV cache + concurrency),
  • pin versions (drivers, runtime, containers),
  • and instrument the boring stuff (latency, queue depth, GPU saturation) so you can debug regressions.

The key insight: you don’t need hyperscaler hardware to get useful, reliable endpoints. You need tight constraints and operational discipline.

Interested in similar solutions?

Let's discuss how I can help with your project.