Running LLMs on Radeon GPUs with ROCm
November 20, 2025·4 min read
I run LLM inference in my homelab because I want:
- predictable cost for always-on endpoints,
- low latency on my LAN,
- and the ability to test “production-ish” operational work (deploys, rollbacks, telemetry) without renting GPUs.
My current default GPU is AMD Radeon (RX 7900 XTX). ROCm has improved a lot, but the important part isn’t “does it run?” It’s “does it stay running after upgrades, restarts, and real traffic?”
TL;DR
- Treat ROCm like part of the platform: pin versions and don’t casually drift.
- Validate the stack bottom-up: device nodes -> host tooling -> container runtime -> model server.
- In a homelab, llama.cpp (GGUF) is often the least painful path to stable, memory-efficient serving.
What I'm Running
I split GPU workloads by node to keep contention predictable:
- Text inference:
llama.cpponcblevins-7900xtx - Image/video generation:
comfyuioncblevins-5930k
I’ve also run vLLM and MLC-LLM on this hardware. They can be great, but llama.cpp won for day-to-day “leave it running” service because GGUF quantization is operationally forgiving.
Bottom-Up Checklist (Host -> Container -> Model)
When ROCm is “broken,” it’s usually one of these:
- The host doesn’t have the right device nodes or kernel modules.
- The container can’t see the device nodes (permissions / runtime mismatch).
- The model server runs, but VRAM behavior under load causes OOMs, fragmentation, or tail latency.
This is the checklist I run before I blame the model.
1) Host sanity: device nodes and basic tooling
On AMD ROCm nodes, I want to see:
/dev/kfd(ROCm kernel driver interface)/dev/dri/*(DRM devices)
And I want at least one host-level tool to confirm the GPU is visible:
ls -la /dev/kfd /dev/dri
rocminfo | head -n 50
rocm-smi || true
If rocminfo fails on the host, nothing above it is going to be stable.
2) Container sanity: can a pod see the GPU?
Before running a model server, I run a tiny “does the container see the device?” check. The goal is to catch runtime issues (missing device mounts, wrong security context) early.
If you’re on Kubernetes, make sure the pod has access to /dev/kfd + /dev/dri and that your device plugin/runtime setup is consistent. A lot of “ROCm is flaky” reports are really “my containers don’t reliably get the device nodes.”
3) Runtime knobs: make RDNA3 less surprising
For RDNA3 (gfx1100), I’ve had the best luck being explicit about a few environment variables instead of letting every framework guess:
PYTORCH_ROCM_ARCH=gfx1100(PyTorch/ROCm builds, when applicable)HSA_OVERRIDE_GFX_VERSION=11.0.0(workaround for tooling/runtime mismatches on consumer RDNA3)HIP_VISIBLE_DEVICES=0or1(pick a GPU deterministically)
For PyTorch-heavy workloads (ComfyUI, diffusion backends), I also use allocator tuning like:
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:256
These aren’t magic. They just reduce “why did this behave differently after a restart?” variance.
4) vLLM on gfx1100: the defaults that stopped the hangs
If you want vLLM on consumer RDNA3, plan for some platform-specific guardrails. In my services/flexinfer work, I ended up encoding the “don’t let this auto-detect itself into a hang” defaults:
- base ROCm vars (RDNA3 sanity):
HSA_OVERRIDE_GFX_VERSION=11.0.0PYTORCH_ROCM_ARCH=gfx1100TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
- vLLM-specific stability gates:
- force vLLM V0 engine:
VLLM_USE_V1=0 - disable Triton flash attention:
VLLM_USE_TRITON_FLASH_ATTN=0 - disable AITER:
VLLM_ROCM_USE_AITER=0
- force vLLM V0 engine:
The meta-point: for AMD consumer GPUs, backend choice is only half the story. The other half is deciding which runtime knobs are “part of the platform” and keeping them pinned.
Real-World Configuration: Text Inference
I've moved from vLLM to llamacpp-server for better support of GGUF quantization and lower VRAM usage. Here is the actual production configuration running on the primary node:
# Qwen2.5-7B with speculative decoding (Excerpt from llamacpp-qwen2p5-7b-spec.yaml)
args:
- |
exec /opt/src/llama.cpp/build/bin/llama-server \
--model /models/qwen2.5-7b-abliterated/Qwen2.5-7B-Instruct-abliterated-v2.Q4_K_M.gguf \
--model-draft /models/qwen2.5-0.5b/qwen2.5-0.5b-instruct-q8_0.gguf \
--ctx-size 16384 \
--n-gpu-layers 9999 \
--n-gpu-layers-draft 9999 \
--draft-max 16 \
--draft-min 4 \
--flash-attn on \
--cache-type-k q8_0 \
--parallel 4
Speculative Decoding Workflow
Speculative decoding is the biggest “free speed” lever I’ve found for interactive chat. The idea: run a small draft model to propose tokens and let the target model verify them in batches.
- Draft Model:
Qwen2.5-0.5B-Instruct - Target Model:
Qwen2.5-7B-Instruct(Abliterated)
Two operational notes that matter more than the theory:
- This is sensitive to VRAM headroom. If you’re tight on memory, the extra model can push you into OOM territory under concurrency.
- It changes the “shape” of latency. Measure p95/p99, not just average tokens/sec.
A Practical Starting Point (GGUF on 24GB)
If you’re running GGUF models on a 24GB card and you want “works reliably” before “chase the last 10%,” here are the knobs I start with:
- context:
--ctx-size 8192(bigger is nice until it isn’t) - batching:
--batch-size 512(throughput vs latency tradeoff) - offload:
--n-gpu-layersas high as you can without OOM
Then I increase concurrency and context slowly and watch tail latency and VRAM behavior.
Failure Modes I Actually Hit
These are the ones that cost me time:
- Version drift: a kernel update or ROCm update that “sort of works” and then fails under load. Fix: pin versions, upgrade intentionally, write it down.
- VRAM lies: a model fits until you add concurrency and a real context window. Fix: budget KV cache and set explicit limits (model len, parallelism).
- Container visibility: the pod is “Running” but doesn’t have usable
/dev/kfdaccess. Fix: validate the device nodes in a trivial pod before deploying the model server.
Next Steps
I’m building out repeatable benchmarks and “known-good” configs so I can answer questions like “did that upgrade make us worse?” without guessing.
If you want the deeper war story version (dual nodes, MLC-LLM compilation/JIT, and the real “what broke first” timeline), see: Deploying MLC-LLM on Dual RX 7900 XTX GPUs.
Check out the AI Infra projects for the full setup.
Related Articles
Comments
Join the discussion. Be respectful.