llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies

Original: Remove padding and multiple D2D copies for MTP by gaugarg-nv · Pull Request #24086 · ggml-org/llama.cpp

A merged llama.cpp PR improves MTP performance by about 4% by simplifying recurrent state handling and reducing device-to-device copies.

llama.cpp merged PR #24086, which changes ggml_gated_delta_net so MTP passes snapshot count K as an operation parameter instead of deriving it from tensor shape. The change removes a padding workaround and copies emitted snapshots into the recurrent cache with a single strided ggml_cpy. Benchmarks on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf showed about a 4% throughput gain, with wall time falling from 21.71s to 20.91s.

The Reddit post points to a merged llama.cpp pull request titled “Remove padding and multiple D2D copies for MTP,” authored by gaugarg-nv and merged into ggml-org/llama.cpp on June 10, 2026. The Reddit submission itself is short, describing it as “Another day, another MTP speedup,” but the GitHub PR provides the substance: it is a low-level performance cleanup in ggml and llama.cpp’s multi-token prediction path, centered on the ggml_gated_delta_net operation and how recurrent state snapshots are represented and copied.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.