Latest in AI

Showing:llama-serverDevelopersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

llama-server Router Mode: Pinned Model Grabs CUDA Context on All GPUs, Causing OOM
r/LocalLLaMA top day50 days agoCommentary
A Reddit user highlighted a limitation in llama-server's router mode (`--models-preset`): child processes spawn and initialize CUDA contexts on all available GPUs, even when pinned to a single card. When other GPUs are fully utilized by a large model, launching a smaller model fails with a CUDA OOM error because it cannot allocate the context stub on the maxed-out cards. Currently, child processes inherit the base environment, preventing per-model `CUDA_VISIBLE_DEVICES` configuration.