Deploying MLC-LLM on Dual RX 7900 XTX GPUs: Debugging VRAM, KV Cache, and K8s GPU Scheduling
January 4, 2026·10 min read
MLC-LLM + ROCm + Kubernetes is a pretty esoteric corner of inference. It’s also one of the few paths where AMD consumer GPUs can feel surprisingly competitive , if you respect the memory model and treat “GPU scheduling” as an operations problem, not a vibe.
I went into this thinking “two 24GB GPUs should be plenty for some 7B–32B-ish inference experiments.” I was half right: the GPUs were plenty. My assumptions about what actually consumes VRAM were not.
This post is a write-up of deploying MLC-LLM across two RX 7900 XTX nodes, what blew up (mostly OOMs), and the changes that made it boring.
TL;DR
- VRAM math is more than weights: weights + KV cache + temp buffers + driver overhead is what actually decides whether a model fits.
- In MLC-LLM, quantization codes like
q4f16_1/q4f32_1encode activation precision too , thatf32can explode temporary/workspace memory even when the weights “fit”. --modeis a sizing preset:interactive(1 req),local(low concurrency),server(many reqs).serverwill happily allocate KV cache to fill VRAM unless you capmax_num_sequence/max_total_seq_length.context_window_size(per request) andmax_total_seq_length(pooled across requests) interact; big context +serverdefaults is an easy path to “allocates to the cliff, then OOMs”.- In Kubernetes, GPU usage must be schedulable: if a pod touches the GPU but doesn’t request
amd.com/gpu, you’ll get “invisible” noisy neighbors and mysterious OOMs. - Treat compilation/JIT and ROCm quirks as operational reality: plan for warmup, and expect some platform-specific stability knobs on RDNA3.
Why MLC-LLM here (and why it feels different)
Most “run a model on a GPU” stacks are runtime-first: you ship Python + a runtime (PyTorch, vLLM, etc.) and rely on vendor kernels.
MLC-LLM is closer to a compilation pipeline: you end up with a model library (compiled via TVM) and a serving engine that targets a specific backend (ROCm in this case). The payoff is that you can get very good throughput on non-NVIDIA GPUs when the kernels and the runtime line up. The cost is that you’re now debugging a system that looks like “compiler + runtime + driver”, not just “a container crashed”.
TVM, briefly (what “ML compilation” means)
Apache TVM is an open-source machine learning compilation framework. At a high level, it takes a trained model and compiles it into deployable modules for a specific target (CUDA, ROCm, Vulkan, CPU, etc.), with the goal of generating fast kernels and a minimal runtime surface.
If you want the official wording and ecosystem context:
- Apache TVM: tvm.apache.org
- TVM repository: github.com/apache/tvm
The reason this matters for ROCm is simple: instead of waiting for every kernel to exist in every framework/runtime, you’re leaning on a compiler stack that’s explicitly designed to target multiple backends.
Where MLC-LLM fits on top of TVM
MLC-LLM describes itself as a “machine learning compiler and high-performance deployment engine for large language models,” and it compiles and runs on “MLCEngine” (their unified inference engine). In practice that means:
- Compile step: produce a model library tuned for your target backend.
- Serve step: run the compiled artifact via an OpenAI-compatible API (plus other integrations).
Project home/docs:
- MLC-LLM documentation: llm.mlc.ai
- MLC-LLM repository: github.com/mlc-ai/mlc-llm
What I was trying to run
Two dedicated GPU nodes, each running a different model:
| Node | GPU | Model | Why |
|---|---|---|---|
cblevins-7900xtx | RX 7900 XTX (24GB) | DeepSeek-R1-Distill-Qwen-7B | Reasoning / “thinky” responses |
cblevins-5930k | RX 7900 XTX (24GB) | Qwen2.5-Coder-7B | Code generation |
Both are served via MLC-LLM (TVM-compiled kernels) on ROCm, and fronted by LiteLLM for routing.
A simple mental model for VRAM (why the napkin math lies)
The mistake I made early on was treating VRAM like it’s mostly “model weights”. In practice, steady-state VRAM is usually:
- Weights (fairly fixed once loaded)
- KV cache (scales with
context_window_sizeand concurrency) - Temporary buffers / workspace (often the surprise; depends on kernels and precision)
- Driver/runtime overhead + fragmentation (the tax you pay for living near 100% utilization)
Kubernetes adds one more gotcha: it can schedule “1 GPU per pod”, but it cannot enforce “this pod may only use 18GB of VRAM”. So once you’re near the cliff, the system becomes sensitive to “noisy neighbors” and VRAM fragmentation.
Failure #1: the 32B model “fit” until it didn’t
I started with Qwen2.5-Coder-32B in q4f32_1. The napkin math looked fine:
- Weights: ~19.5GB
- VRAM: 24GB
- Headroom: ~4.5GB
Then MLC-LLM helpfully told me what I forgot to budget for:
Insufficient GPU memory error:
Available: 20876 MB
Model weights: 19532 MB
Temporary buffer: 17119 MB
Total needed: ~36GB
The key detail is the quantization code. In MLC-LLM, weight-only quantization is described as qAfB(_id), where A is weight bits and B is activation bits (q4f16_1, q4f32_1, etc.). That f32 activation format is expensive in working memory, especially once you factor in temporary buffers.
Reference: MLC-LLM quantization docs (qAfB format and available modes)
I pivoted to the 7B model because I wanted something stable before I went back to “big model” debugging.
Failure #2: --mode server and the KV cache that ate my GPU
With the 7B model, I hit a different OOM. MLC-LLM’s serve mode presets do a lot of auto-sizing for you, and --mode server is intentionally aggressive.
Here’s what I saw on startup:
Under mode "server", max KV cache token capacity: 247,329 tokens
Estimated total: 20,876 MB
- Parameters: 4,086 MB
- KV Cache: 13,607 MB
- Temp buffer: 3,184 MB
What’s happening: per the MLC-LLM server docs, --mode server “automatically infer[s] the largest possible max batch size and max total sequence length” and tries to use GPU memory as much as possible. That’s exactly what you want for high concurrency. It’s not what you want if you’re deploying a single model per node and care more about “don’t crash” than “maximize utilization”.
Reference: MLC-LLM REST server docs (--mode local|interactive|server behavior)
What the modes change (in real deployments)
MLC-LLM’s --mode isn’t a “performance toggle” so much as a preset that chooses defaults for a few sizing knobs:
max_num_sequence: how many concurrent sequences/requests the engine is willing to keep live.max_total_seq_length: the total KV-cache token budget across all live sequences.prefill_chunk_size: how much prompt the engine tries to prefill per step (affects peak memory during prefill).
The practical differences look like this:
| Mode | What it’s optimized for | Default sizing behavior | What you get | What you risk |
|---|---|---|---|---|
interactive | one user / one request at a time | max batch = 1; total seq length = context window | lowest steady VRAM; predictable behavior | less throughput under concurrency |
local | low concurrency “local server” | max batch = 4; total seq length = context window | some parallelism; still bounded | can still OOM if other processes eat VRAM |
server | many concurrent requests | infers “largest possible” batch + total seq length | highest throughput potential | will happily spend all remaining VRAM on KV cache unless you cap it |
In my setup, LiteLLM was routing requests to “one model per node” with low concurrency. That’s why local made sense: I wasn’t trying to run a high-QPS multi-tenant inference server, I was trying to keep a single model stable while still allowing a little parallelism.
Yes: context_window_size and --mode server interact
This is the confusing bit: context_window_size is a per-request limit, but max_total_seq_length is a pool across all concurrent requests. In local/interactive, MLC-LLM ties the pool to the context window, so memory stays bounded and predictable.
In server, MLC-LLM tries to infer the “largest possible” values and will allocate KV cache to fill the GPU. If you also set a large context_window_size, you’re telling the engine “large prompts are allowed”, and the inferred KV/cache budget can get pushed right up to the edge of VRAM. In Kubernetes (and on consumer GPUs), “right up to the edge” is where you lose to fragmentation, driver-reserved memory, and any other pod that accidentally touches the GPU.
Another way to say it:
context_window_sizecontrols the maximum size of one conversation.max_total_seq_lengthcontrols the total token budget you’re reserving across all conversations.
Server mode is trying to maximize throughput by reserving a large token budget (KV cache) for many concurrent requests. If you’re not actually serving many concurrent requests, that extra KV cache is just unused VRAM that makes OOMs more likely.
If you want server mode and stability, you usually need to make the concurrency and token budget explicit, e.g. (illustrative numbers):
args:
- '--mode'
- 'server'
- '--overrides'
- 'context_window_size=32768;max_num_sequence=4;max_total_seq_length=131072'
The fix: pick the right mode (then override deliberately)
For low concurrency, --mode local was the right default for me:
args:
- '--mode'
- 'local'
- '--overrides'
- 'context_window_size=32768'
And the resulting estimate was dramatically smaller:
Estimated total: 7,620 MB
- Parameters: 4,086 MB
- KV Cache: 528 MB
- Temp buffer: 3,005 MB
If you do need server mode throughput, the docs explicitly call out that you can override the inferred values (e.g. max_num_sequence, max_total_seq_length, prefill_chunk_size) via --overrides. The important move is making those constraints explicit instead of letting "fill the GPU" be the default.
Failure #3: the "invisible" GPU workload Kubernetes didn't schedule
Even after switching to local, I still hit OOM during parameter load (LoadParams()), before KV cache was even a factor.
The root cause: another pod on the node was already using the GPU, but it didn’t declare a GPU resource in Kubernetes.
This is the anti-pattern:
resources:
limits:
cpu: '12'
memory: 24Gi
# No amd.com/gpu declared
From Kubernetes’ perspective, that pod is “CPU-only”, so it can be scheduled alongside your “real” GPU workload. From ROCm’s perspective, both processes are fighting over the same VRAM. You don’t get a clean scheduling failure , you get runtime OOMs that look like “your model is too big”.
The fix: make GPU usage schedulable (and auditable)
Declare the GPU and (optionally) pin the workload to GPU nodes:
resources:
limits:
amd.com/gpu: '1'
cpu: '12'
memory: 24Gi
If you're running AMD's k8s-device-plugin, you can sanity-check what the scheduler thinks exists with something like:
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name) \(.status.allocatable[\"amd.com/gpu\"] // \"0\")"'
Where I landed (stable and boring)
Once those three issues were fixed, the topology was straightforward:
Operational notes
Two things that don't fit the "failure → fix" narrative but matter in production:
Compilation and JIT overhead
MLC-LLM can run with a prebuilt model library (--model-lib), or it can JIT compile when a library isn't provided or when configuration changes force regeneration.
For this setup, I ran one-off Kubernetes Jobs to compile each model before deploying the serving pods. The compilation job runs on the target GPU, produces a model library artifact, and writes it to a PVC. The serving deployment then mounts that PVC and points --model-lib at the precompiled artifact. This keeps cold starts fast and avoids burning GPU time on recompilation every time a pod restarts.
The workflow lives in GitOps: the compilation Job manifests sit alongside the serving Deployments, and the PVC is provisioned by Longhorn with Retain reclaim policy. When I need to update a model or change quantization settings, I update the Job spec, run it manually (or let Flux reconcile it), wait for completion, then the serving pods pick up the new artifact on their next restart. The cache survives pod churn and node reboots.
Even with precompiled libraries, a few things still matter:
- Config changes can trigger recompilation. Context window sizing and parallelism settings can influence which artifact gets built, if you change config, you may need to recompile.
- Warmup matters. I treat "deployment is up" and "first request is fast" as two separate milestones. The first request still has some overhead even with a prebuilt library.
If you skip the compilation job and rely on JIT, plan for it: make sure the pod has enough CPU/RAM for compilation phases, and consider persistent storage for caches if you're redeploying frequently.
ROCm on RDNA3
This is the most "homelab" and least universal part, but it mattered for my RX 7900 XTX nodes (gfx1100). I ended up needing these ROCm settings for stability:
env:
- name: HSA_OVERRIDE_GFX_VERSION
value: '11.0.0'
- name: HSA_ENABLE_SDMA
value: '0'
If you're on a different ROCm version or kernel/driver stack, you might not need these, or you might need different ones. I'm including them because they were the difference between "random flakiness" and "boring".
Directional performance notes
On these nodes (RX 7900 XTX, ROCm 6.x), both 7B models can sustain ~100+ tokens/sec once warm, in single-request scenarios. Prompt length, context window, and mode (batching/KV cache policy) all matter a lot, so treat these as directional.
One thing I didn't expect: platform age didn't matter. The cblevins-5930k node is an Intel Core i7-5930K (Haswell-E, 2014) on X99 with DDR4. The cblevins-7900xtx node is an AMD Ryzen 9 7900X3D on AM5 with DDR5, a decade newer. Both nodes hit similar token throughput with the same GPU. For GPU-bound inference at this scale, the CPU and memory subsystem just aren't the bottleneck.
Practical takeaways
- Budget VRAM for more than weights. Parameters are the floor. KV cache and temp buffers are the ceiling.
- Pick
--modefor your concurrency goal.serveris a throughput preset; it will aggressively grow cache/batch sizing unless you override. - Make GPU use visible to the scheduler. If you’re using the GPU, declare it in
resources.limits, even for “small” sidecars. - Expect JIT/compile overhead on first run. When a model lib isn’t provided (or config changes require recompilation), startup will take longer. The MLC docs explicitly call this out as JIT compilation.
The Kubernetes manifests and exact configs for this setup live in my internal GitOps repo (not public), but the failure modes and fixes above should transfer cleanly to most ROCm + MLC-LLM deployments.
If you're new to MLC-LLM, start at the project documentation or GitHub repository.
Related Articles
Comments
Join the discussion. Be respectful.