
Did you ever notice that sometimes while you use a model locally, you run into a sudden drop in performance? Today I want to talk about that. I'm building an open source tool that aims to help determine the best configuration for a local llm for a given machine, and I scratched my head about this issue, because it seems simple but it's really tricky. First of all you have to determine the allocation that the model takes in your VRAM budget. For ease of explanation, I'm going to use Qwen 3.5 9B Q4 K M which is the model I've been using to battle test this specific problem. My hardware specification: I have a RTX 2070 with 8GB VRAM, 24GB of RAM. I loaded Qwen, it sat on my VRAM but I had a really restricted 16k to 32k context max, also leaving some memory free. I asked myself: but why does this happen? These apps we use to run local models try to make them "work" with the current conditions we have on our computer. The heavy lifting would be determining the best configuration and then scale down from that. The problem is users are humans, and humans forget things. Imagine you are playing Skyrim or GTA, or watching a Youtube video. You're locking down VRAM with that. RAM that the next Qwen is really eager to use to be faster and have more context for your next prompts. GRRRRR! As you load Qwen in VRAM, the VRAM usage bumps up to 6.8GB with full offload of those holy layers. Then you unleash the kv preallocation - llama.cpp does preallocate the memory as you start it - which is roughly ==22MB per 1k token== - from my empirical tests. So if you choose 16k you get 352MB of VRAM 32k is 704MB and so on. Doing some math 8GB is 8192MB , let's say you're aware that youtube podcast you're listening in the background is using 500-800 MB of the gpu, so you close it. System reserves 0.5 to 1GB - we're talking windows now - so to be safe.. you have 7000MB available? Qwen uses 6.8GB , so it's fine. You load 131k of context and start using the chat interface and everything is fine! It works! You bypassed that ugly problem and now you can use the model with its full context. You start using it seriously, the context goes up to 30, 40, 50k . At some point you reach 60k and it starts to feel a bit slower. 70k even slower, but not a normal slower a really strong drawdown in generation and also during prompt processing. You reach 90k and you're down from 32 tok/sec to 16 tok/sec - and prompt processing takes an even harder hit, considering the initial 488 tok/sec to 41.01 tok/sec . You start a new chat, it feels great again, at 80-90k you have the same problem. What's happening? Why does it work fine until it doesn't? That's the ==KV cache spilling== from the VRAM to the RAM. ==Once the context grows, at some point the prompts and responses will be moved from GPU to RAM==. For that reason, most applications use constrained context to completely avoid this kind of issue. Windows is magic sometimes because it doesn't go out of memory, it uses shared memory to manage critical situations. The first part of the memory which is in VRAM will respond really fast, just once you reach some specific amount of context the eval will drastically fall and you end up using your model with about 50% less speed. In the next part I will share how I started to notice this, what was not working and in part 3 I will share the fixes I put in place to manage that. These are the runs used to build the chart above Qwen 3.5 9B with 131k context | Used KV | Eval t/s | Delta from 8k | Prompt t/s | |----|----|----|----| | 8k | 42.3 | Baseline | 488.25 | | 65k | 32.51 | -23.1% | 87.66 | | 90k | 16.61 | -60.7% | 41.01 | | 105k | 15.66 | -63% | 36.89 | | 120k | 14.81 | -65% | 45.13 | Qwen 3.5 2B with 131k context | Used KV | Eval t/s | Delta from 8k | Prompt t/s | |----|----|----|----| | 8k | 103.49 | Baseline | 3902.12 | | 65k | 72.62 | -29.8% | 3011.60 | | 90k | 67.49 | -34.8% | 2702.58 | | 105k | 64.87 | -37.3% | 2498.7 | | 120k | 60.82 | -41.2% | 2326.17 | *The data in the image - the green line in the chart- is from a control test on generation speed with a model - Qwen 3.5 2B Q4 K M - that I knew would stay entirely in VRAM at the same context.
View original source — Hacker Noon ↗


