Latest in AI

Showing:gpu-backendsClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies
r/LocalLLaMA top day47 days agoRelease
llama.cpp merged PR #24086, which changes ggml_gated_delta_net so MTP passes snapshot count K as an operation parameter instead of deriving it from tensor shape. The change removes a padding workaround and copies emitted snapshots into the recurrent cache with a single strided ggml_cpy. Benchmarks on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf showed about a 4% throughput gain, with wall time falling from 21.71s to 20.91s.