Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day·yesterday·Benchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.