r/LocalLLaMA top dayJun 7, 2026, 9:09 PM/u/HockeyDadNinja

llama-server Router Mode: Pinned Model Grabs CUDA Context on All GPUs, Causing OOM

Original: llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

In llama-server router mode, child processes initialize CUDA contexts on all GPUs, causing OOM when other cards are fully loaded.

A Reddit user highlighted a limitation in llama-server's router mode (`--models-preset`): child processes spawn and initialize CUDA contexts on all available GPUs, even when pinned to a single card. When other GPUs are fully utilized by a large model, launching a smaller model fails with a CUDA OOM error because it cannot allocate the context stub on the maxed-out cards. Currently, child processes inherit the base environment, preventing per-model `CUDA_VISIBLE_DEVICES` configuration.

想看英文原文 / 完整內容?

前往 r/LocalLLaMA top day 原文 →

摘要由 AI 整理,以原文為準。