ggml-webgpu improves prefill speeds for k-quants in llama.cpp PR
r/LocalLLaMA top day·15 hours ago·Benchmark
llama.cpp PR #24225 improves ggml-webgpu matrix multiplication performance for k-quants and refactors matmul paths for Q4/Q5/Q8 and k-quants. In pp512 tests on an M2 Pro, reported speedups range from about 1.33x to 3.78x across Q2_K, Q3_K, Q4_K, Q5_K, and Q6_K. The largest gains appear on Q3_K models, including Qwen and Gemma examples.