Pipeline parallelism in llama.cpp may be wasting your VRAM
A Reddit test claims llama.cpp pipeline parallelism may use extra VRAM without improving single-request inference speed.
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.
這篇 Reddit 貼文分享作者對 llama.cpp pipeline parallelism 的實測結果。作者指出,llama.cpp 預設會啟用 pipeline parallelism,推測目的是加速推論;但在他的測試環境中,這個機制沒有帶來可見的速度收益,反而明顯增加 VRAM 佔用。作者使用 Vulkan backend,比較三種 build:預設的 GGML_SCHED_MAX_COPIES=4、改成 GGML_SCHED_MAX_COPIES=1,以及透過 GGML_BLAS=ON、GGML_BLAS_VENDOR=OpenBLAS 間接停用 pipeline parallelism 的版本。測試模型是 Qwen3.6-27B-MTP 的 GGUF 量化版本,並以 llama-server、全層 offload 到 GPU、flash attention、K cache f16、V cache q8_0 等設定執行。
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Summaries are AI-generated; the original article is authoritative.