Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

Original: Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

A LocalLLaMA user reports 44-55 TPS for Qwen3.6-MTP-27B on a Tesla V100 and asks how to optimize llama.cpp settings.

A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.

A post on r/LocalLLaMA discusses practical inference tuning for Qwen3.6-MTP-27B-MTP running through llama.cpp on an NVIDIA Tesla V100 with 32GB of VRAM. The author is using a Q4_K_M GGUF build of the model and reports that generation speed reached 55 tokens per second once, but typically lands around 44-48 tokens per second. The central question is whether this is a normal result for a V100-class GPU or whether the configuration leaves meaningful performance on the table without needing to reduce output quality.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.