Pipeline parallelism in llama.cpp may be wasting your VRAM
r/LocalLLaMA top day·17 hours ago·Benchmark
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.