r/LocalLLaMA top dayJun 8, 2026, 3:51 PM/u/No-Selection2972important 72

Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server

Original: Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server

Xiaomi MiMo claims a 1T MoE model can exceed 1,000 tokens/s on one standard 8-GPU node.

Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.

Xiaomi MiMo 在 2026 年 6 月 8 日發表 MiMo-V2.5-Pro-UltraSpeed,主張與 TileRT 合作後,讓 1 兆參數級 MoE 模型的輸出解碼速度突破 1000 tokens/s,展示中最高約可到 1200 tokens/s。這則 r/LocalLLaMA 貼文的焦點在於:如果官方說法成立,這個速度不是靠 Cerebras 的晶圓級整合硬體,也不是靠 Groq 這類大量片上 SRAM 的特殊架構,而是在單一標準 8-GPU 商用節點上達成,對大模型即時推論成本與部署門檻都有指標意義。官方文章將突破歸因於模型與系統共同設計。模型側使用 FP4 量化來降低記憶體占用與頻寬壓力,但不是全模型粗暴量化,而是針對 MiMo-V2.5-Pro 的 MoE experts 進行量化,其他模組維持原精度,並透過 FP4 QAT 盡量保留能力。解碼側則採用 DFlash speculative decoding,透過區塊級 masked parallel prediction 讓草稿階段一次預測多個 token,再由大模型驗證;官方列出的接受長度在 coding 場景平均 6.30、math/reasoning 5.56、agent 4.29,但也承認一般對話場景的接受率還不夠高。系統側由 TileRT 提供針對這條量化與推測解碼管線最佳化的編譯引擎與 kernel,包含 persistent engine kernel、warp specialization 等降低 operator launch 與同步成本的設計。官方同時開放 MiMo-V2.5-Pro-FP4-DFlash checkpoint 到 HuggingFace,並提供 6 月 9 日至 6 月 23 日的限量申請試用與 API。整體來看,這是值得追蹤的高效推論案例,尤其對 coding agent、即時互動與高吞吐代理系統有潛在影響;但目前主要證據仍來自 Xiaomi/TileRT 官方敘述,尚未看到獨立可重現 benchmark,因此重要性應保守評估。

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on r/LocalLLaMA top day →

Summaries are AI-generated; the original article is authoritative.