Running Two LLMs on a Mini PC Sounds Great Until the Benchmarks Arrive

If you've been following my stuff, you know I'm all about squeezing maximum value out of minimal hardware. Mini PCs, home labs, self-hosted everything. So naturally, when I got my hands on a UM790Pro with 96 GB of DDR5, my first thought was: can I run two LLMs simultaneously ? The answer is yes. The better question is: should you? No . I have the benchmarks . The Setup The UM790Pro is a beast for its size. Here's what I'm running on: CPU: AMD Ryzen 9 7940HS GPU: AMD 780M iGPU (integrated, shares system memory) RAM: 96 GB DDR5-5600 VRAM Pool: 2 GB dedicated + 46 GB GTT = 48 GB total GPU-accessible memory Memory Bandwidth: ~80 GB/s (shared between CPU and iGPU) That last point is the key to everything that follows. On a discrete GPU, the CPU and GPU have their own separate memory buses. On an APU like the 7940HS, the CPU and iGPU drink from the same straw. DDR5-5600 gives you roughly 80 GB/s, and both the CPU cores and the GPU compute units fight over every byte of it. I'm running Ollama as my inference server. Four models in the ring. The 35B MoE model is the big gun, my daily driver for coding and complex reasoning. The smaller models were candidates for a sidecar role: handling quick tasks like summarization or classification while the big model crunches harder problems. Baseline: One Model at a Time First, I benchmarked each model running alone to get clean numbers. The 35B model at 17.8 tok/s on an iGPU is genuinely impressive. That's usable for interactive chat. The small models are blazing fast. Gemma at 42.9 tok/s on GPU is practically instant for short responses. Looking at these numbers, I thought: what if I keep the 35B on GPU and run a small model on CPU simultaneously? Best of both worlds, right? The Dual-Model Experiments I ran four combinations, firing both models at the same time with identical prompts and measuring throughput. Test 1: Both Models on GPU qwen3.6:35b (GPU) + gemma4-e2b (GPU). Both models fighting for the same GPU compute units and the same memory bus. The 35B model drops from 17.8 to 13.1 tok/s, a 26% hit. Gemma drops 41%. Painful but expected. Test 2: Big Model GPU + Tiny Model CPU (The Best Result) qwen3.6:35b (GPU) + qwen2.5:1.5b (CPU). This was the best result. The 1.5B model is tiny enough that its CPU inference doesn't hammer memory bandwidth too hard. The big model only drops 16%. But the small model gets cut in half, from 53.4 to 26.2 tok/s. Test 3: Big Model GPU + Medium Model CPU-Forced qwen3.6:35b (GPU) + gemma4-e2b (CPU, num_gpu=0). Forcing Gemma to CPU didn't help. The 4.6B model doing CPU inference generates enough memory traffic to compete with the GPU's reads. Both models suffer. The 35B drops 27%, Gemma drops 53%. Test 4: The Worst Case (KV Cache Explosion) qwen3.6:35b (GPU) + qwen3:4b-instruct (CPU, num_gpu=0). This was the disaster scenario. The 4B instruct model supports a 256K context window, and its KV cache ballooned to 24.2 GB at full context. Combined with the 35B model's 32 GB VRAM allocation, we were pushing close to the system's total memory bandwidth capacity. Both models crawled. The 35B dropped 35%, the 4B dropped 43%. The Memory Architecture Problem What's actually happening inside this machine is simple once you see it. The VRAM pool breaks down like this: 2 GB dedicated VRAM physically reserved for the iGPU, plus 46 GB GTT (Graphics Translation Table) which is system RAM mapped into GPU address space, for 48 GB total GPU-accessible memory. When both a GPU model and a CPU model are running, they're both streaming weights from the same DDR5 DIMMs through the same memory controller. The GPU doesn't have its own GDDR6 with 300+ GB/s bandwidth like a discrete card. It's sharing the same 80 GB/s pipe as everything else. It's not a compute bottleneck. It's a memory bandwidth bottleneck . Real-World Conclusion I was testing this because I wanted to run an agent framework: a planning model plus an execution model working together. The idea was the big 35B model handles complex reasoning while a small model handles quick tool-calling or classification. But agent frameworks run tasks sequentially, not in parallel. The planner thinks, then the executor acts, then the planner thinks again. They take turns. At any given moment only one model is generating tokens. The other is just sitting there, loaded in memory, doing nothing but occupying VRAM or RAM that could go toward bigger context windows instead. So the dual-model setup gives you worse throughput on the big model (11-15 tok/s vs 17.8 tok/s), no parallelism benefit in sequential agent workflows, wasted memory keeping a second model loaded, and risk of OOM crashes since Ollama's iGPU memory reporting has a known bug that can cause crashes with multiple loaded models. The MoE Insight Here's the moment that made me feel silly for even running these tests. The qwen3.6:35b model is a Mixture of Experts architecture. It has 256 experts but only activates 8 per token. For any given token, it's doing roughly the compute of a 4-5B parameter model while having the knowledge of a 36B parameter model. Read that again. The big model already IS the small model in terms of per-token compute cost. MoE gives you the reasoning depth of 35B parameters with the inference speed of a much smaller model. Running a separate small model alongside it for fast tasks is solving a problem that doesn't exist. 17.8 tok/s for 35B-class reasoning is already fast enough for everything I throw at it. Adding a second model only makes it slower. Bonus: Ollama Storage Gotchas While poking around, I found a couple things worth mentioning. Shared blobs save disk space. I had qwen3.6:35b, qwen3.6:latest, and qwen3.6:35b-nothink all listed as separate models. Turns out they all point to the same 23.9 GB blob on disk. Ollama uses content-addressed storage, so identical weights are stored once regardless of how many tags reference them. Orphan blobs waste disk space. After deleting some models, I found a 12.9 GB orphan blob sitting in the models/blobs directory that no tag referenced anymore. There's no ollama prune command yet, so I had to manually cross-reference blob hashes against manifest files and delete the orphan by hand. Check yours. You might be surprised. \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook