r/LocalLLaMA top dayJun 7, 2026, 9:09 PM/u/HockeyDadNinja

llama-server Router Mode: Pinned Model Grabs CUDA Context on All GPUs, Causing OOM

Original: llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

In llama-server router mode, child processes initialize CUDA contexts on all GPUs, causing OOM when other cards are fully loaded.

A Reddit user highlighted a limitation in llama-server's router mode (`--models-preset`): child processes spawn and initialize CUDA contexts on all available GPUs, even when pinned to a single card. When other GPUs are fully utilized by a large model, launching a smaller model fails with a CUDA OOM error because it cannot allocate the context stub on the maxed-out cards. Currently, child processes inherit the base environment, preventing per-model `CUDA_VISIBLE_DEVICES` configuration.

This popular discussion from Reddit's r/LocalLLaMA points out a serious multi-GPU memory management flaw that exists when using `llama-server`'s routing mode (via the `--models-preset` parameter) for dynamic management of multiple models.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on r/LocalLLaMA top day →

llama open-source llama-cpp #multi-gpu #cuda #oom #llama-server

Summaries are AI-generated; the original article is authoritative.