Finding the Real Context Ceiling: Needle-Benchmarking Forced RoPE Extrapolation
5 min read
TL;DR
- We wanted more context on a self-hosted uncensored 35B MoE. Its config looks like it supports 256K, but the real text window is 32K.
- Forcing vLLM past that (
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1) let it load at 64K and even 96K — no errors, no OOM. - But "it loaded" is not "it works." A progressive needle-in-haystack bench showed output stays coherent to ~60K and then falls off a cliff into
!!!!!garbage. - The limit was RoPE coherence, not VRAM. We pinned the lane at 64K — the largest length that actually retrieves facts — and shipped the bench as a reusable tool.
The setup
One of our daily-driver lanes is a quantized 35B Mixture-of-Experts model with a multi-token-prediction draft head, served on a single 24 GB AMD card via vLLM. It shipped at a 32K context window. The ask was simple: can we make it bigger?
The first surprise was in the model config. It advertised two different limits:
{
"max_position_embeddings": 262144, // top-level — looks like 256K!
"text_config": {
"max_position_embeddings": 32768 // the real text window
},
"vision_config": { ... } // this is a multimodal config
}
That top-level 262144 is the vision envelope, not the text RoPE. vLLM correctly derives the text model's max_model_len from the nested text_config value: 32768. Ask for more and it refuses:
User-specified max_model_len (65536) > derived max_model_len
(max_position_embeddings=32768). VLLM_ALLOW_LONG_MAX_MODEL_LEN must be
used with extreme caution...
Lesson one: when a model claims a giant context, check whether that number is the text window or a multimodal envelope. They are frequently not the same.
Forcing it open
The model's RoPE base frequency (rope_theta) was set to 10,000,000 — an aggressive value associated with long-context extrapolation. That made it plausible the model could run past its declared 32K even though it was never trained there. So we forced the door open with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and asked for 64K.
It loaded. Short prompts answered fine. vLLM reported healthy KV headroom (2.91× concurrency at 64K). Easy win, right?
This is the trap. A model that loads at a longer context will happily accept long prompts and emit confident-looking tokens — whether or not those tokens are correct. RoPE extrapolation past the trained window doesn't throw an exception; it silently degrades. You cannot see it in a health check, a short smoke test, or a metrics dashboard. You can only see it by asking the model to use the long context and checking the answer.
The bench: a needle in a growing haystack
So we built a progressive needle-in-haystack coherence test. The recipe is deliberately boring:
- Plant a unique fact at the very start of the prompt: "The vault override passphrase is MARMALADE-73118."
- Pad with filler until the prompt hits a target token count.
- At the very end, ask the model to repeat the passphrase verbatim, at temperature 0.
- Pass = the exact needle comes back. Fail = wrong answer, or degenerate output.
Because the needle sits at the start and the question at the end, a pass requires the model to attend coherently across the entire window — exactly the thing RoPE extrapolation breaks. We swept the target length upward:
| Prompt tokens | Result | Output |
|---|---|---|
| 42,649 | ✅ PASS | exact recall |
| 60,589 | ✅ PASS | exact recall |
| 73,469 | ❌ FAIL | !!!!!!!!!!!!!!!!!!!! |
| 87,499 | ❌ FAIL | !!!!!!!!!!!!!!!!!!!! |
There it is. Coherent to ~60K, then a hard cliff between 60K and 73K into pure degenerate repetition. And to be explicit about the headline finding: we also tried 96K. It loaded cleanly — vLLM reported 2.01× KV concurrency, zero OOM. And every long generation past ~64K was garbage.
The GPU was never the constraint. The trained context was. Dropping concurrency to squeeze in a longer window would have bought nothing — the longer window doesn't work at any concurrency.
Where we landed
We pinned the lane at 64K — the largest window that reliably retrieves facts, which also still fits two concurrent requests with comfortable KV margin. That's a 2× usable-context improvement over the original 32K, and every token of it is backed by a passing coherence test rather than a hopeful config value.
The probe is now a checked-in tool, context-needle-bench.py:
# explicit points
context-needle-bench.py --base-url http://localhost:8000 --model my-model \
--points 32k,48k,64k,80k,96k
# progressive sweep, stop at first failure
context-needle-bench.py ... --points 32k,64k,96k,128k --stop-on-fail
# binary-search the exact cliff
context-needle-bench.py ... --bisect 64k:128k --bisect-tol 8k
# depth grid: one needle at several depths -> length x depth recall grid
context-needle-bench.py ... --points 48k,60k,73k --depths 0,25,50,75,100
# multi-needle: N labeled facts spread across one prompt, recall all of them
context-needle-bench.py ... --points 48k,60k --needles 5
It has no third-party dependencies, so it runs anywhere — including inside a serving pod, straight against localhost:
kubectl exec -i <pod> -c model -- python3 - < context-needle-bench.py -- \
--model my-model --points 48k,64k,80k
But is one needle enough?
A single fact at the very start of the prompt is the easy case. Two failure modes it doesn't catch: a model that loses the middle of a long context (the well-documented "lost in the middle" effect), and a model that can echo one fact but falls apart tracking several at once. If 64K is going to be a real working window, it has to survive both. So we extended the bench and pointed it back at the live model.
Depth grid — same needle, planted at 0%, 25%, 50%, 75%, and 100% of the context:
len \ depth 0% 25% 50% 75% 100%
49152 ✓ ✓ ✓ ✓ ✓
61440 ✓ ✓ ✓ ✓ ✓
Recall is flat across every depth. Inside the coherent zone the model doesn't care where the fact lives — there's no middle sag. That's consistent with what the cliff already implied: the failure is a function of total length, not position. Either the whole window works or it falls off the edge; there's no gradual fade.
Multi-needle — five distinctly labeled passphrases scattered across one prompt, with the model required to return all five:
| Prompt tokens | Needles recalled |
|---|---|
| 50,315 | 5 / 5 |
| 63,172 | 5 / 5 |
Perfect recall of all five facts at 63K. So the 64K window isn't just "can quote one sentence" — it holds multiple facts spread end to end. That's the difference between a number that passes a toy test and a window you can actually do retrieval-style work in.
Two new modes in the bench (--depths, --needles) make both of these one-liners, so the next forced-extrapolation lane gets the same treatment for free.
The takeaway
Three things we'll keep doing:
- Read the nested config. A top-level
max_position_embeddingscan be a multimodal envelope. The text window may be far smaller. - Never trust "it loaded." Forced RoPE extrapolation fails silently. Loading, passing a health check, and answering a short prompt tell you nothing about whether the long context is usable.
- Pin to what you proved, not what fit. VRAM will happily let you allocate a context the model can't think in. Benchmark coherence, find the cliff, and pin below it.
Borrowed context length has a short half-life, too. Measure it before you depend on it.
Related Articles
Comments
Join the discussion. Be respectful.