llama-server Router Mode: Pinned Model Grabs CUDA Context on All GPUs, Causing OOM
Original: llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?
In llama-server router mode, child processes initialize CUDA contexts on all GPUs, causing OOM when other cards are fully loaded.
A Reddit user highlighted a limitation in llama-server's router mode (`--models-preset`): child processes spawn and initialize CUDA contexts on all available GPUs, even when pinned to a single card. When other GPUs are fully utilized by a large model, launching a smaller model fails with a CUDA OOM error because it cannot allocate the context stub on the maxed-out cards. Currently, child processes inherit the base environment, preventing per-model `CUDA_VISIBLE_DEVICES` configuration.
想看英文原文 / 完整內容?
前往 r/LocalLLaMA top day 原文 →摘要由 AI 整理,以原文為準。