Latest in AI

Showing:speculative-decodingClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

GLM-5.2: World's Top Open Frontend Coding Model + IndexShare Speculative Decoding
Latent Space41 days agoRelease
GLM-5.2 has claimed the leading position worldwide among open models on frontend coding benchmarks, marking a significant milestone for the open-source AI ecosystem. The release is accompanied by IndexShare, a new method targeting speculative decoding to improve inference throughput and reduce serving latency. Together, the two developments advance both capability and deployment efficiency for teams building with open models.
Why MoE Models Benefit More from Speculative Decoding
Cohere Blog46 days agoBenchmark
Cohere analyzes why speculative decoding behaves differently on Mixture-of-Experts models than on dense LLMs. Its benchmarks show MoE speedups can peak at moderate batch sizes because sparse expert routing keeps verification bandwidth-bound. The post also finds that temporal expert overlap and fixed overhead amortization make multi-token verification cheaper than simple worst-case models predict.
Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day48 days agoBenchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
Packed twin inference doubles Qwen3.6-27B throughput on one MI50
r/LocalLLaMA top day49 days agoBenchmark
A LocalLLaMA user shared an early packed-twin-inference experiment for local LLM acceleration. The idea resembles speculative decoding, but uses the same quantized model side-by-side instead of a smaller draft model. On a single AMD MI50, the author reports Qwen3.6-27B improving from 19.4 to 38.1 tk/s, with Q8-or-lower quantization as the main target.
Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server★ 72
r/LocalLLaMA top day49 days agoBenchmark
Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.
llama.cpp Gemma4 MTP Support Merged
r/LocalLLaMA top day51 days agoRelease
llama.cpp PR #23398 was merged on June 7, 2026, adding MTP support for Gemma4 models. The author reports over 2x average speedup on dense models, no observed speedup on MoE, and replicated AIME-26 results around 87%. Support currently covers 31B and 26B-4B variants, while E4B and E2B are not supported yet; multi-GPU may need extra draft-device configuration.
在 Intel Core Ultra 上利用深度剪枝草稿模型加速 Qwen3-8B Agent★ 75
Hugging Face Blog302 days agoTutorial
As AI Agent applications become increasingly widespread, running large language models (LLMs) efficiently on personal computers (such as AI PCs powered by…
你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82
Hugging Face Blog320 days agoTutorial
### Background and the LLM Inference Bottleneck When running large language models (LLMs), autoregressive generation is inherently "memory-bandwidth-bound"…
使用自投機解碼（Self-Speculative Decoding）加速文本生成：Meta 推出 LayerSkip 技術★ 78
Hugging Face Blog615 days agoRelease
The slow autoregressive generation speed of large language models (LLMs) has long been a major bottleneck in real-world deployment. While "speculative…
Universal Assisted Generation：支援任意輔助模型的通用輔助生成技術，大幅提升解碼速度★ 85
Hugging Face Blog637 days agoRelease
In the deployment and inference of large language models (LLMs), reducing generation latency has always been a critical challenge. The traditional approach of…
透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75
Hugging Face Blog658 days agoRelease
Hugging Face has published a technical blog post on "Dynamic Speculation," aimed at optimizing the inference speed of large language models (LLMs)…
Intel Gaudi 支援更快的輔助生成（Assisted Generation），顯著提升 LLM 推理速度
Hugging Face Blog784 days agoRelease
Hugging Face, in collaboration with Intel, has announced official support for "Assisted Generation" (also commonly known as Speculative Decoding) on Intel…
使用 Hugging Face Inference Endpoints 實現高效能 ASR、語者辨識與投機解碼★ 75
Hugging Face Blog818 days agoTutorial
This technical blog post from Hugging Face introduces how to build a powerful and efficient speech processing system using Hugging Face Inference Endpoints — a…
使用 🤗 Optimum Intel 在 Xeon 處理器上加速 StarCoder：Q8/Q4 量化與投機解碼
Hugging Face Blog910 days agoTutorial
This Hugging Face blog post explores in detail how to use the `Optimum Intel` library to accelerate inference for the StarCoder code-generation model on Intel…
使用投機解碼（Speculative Decoding）將 Whisper 推論速度提升 2 倍★ 75
Hugging Face Blog951 days agoTutorial
The Hugging Face official blog introduces how to use "Speculative Decoding" to more than double the inference speed of OpenAI's Whisper speech-to-text model…
在生產環境中優化你的大語言模型 (LLM) — Hugging Face 實戰指南★ 85
Hugging Face Blog1,047 days agoTutorial
This technical guide from Hugging Face systematically introduces the core strategies for deploying and optimizing large language models (LLMs) in production…
Hugging Face 推出 Assisted Generation：邁向低延遲文本生成的新方向★ 85
Hugging Face Blog1,174 days agoRelease
Large language models (LLMs) typically generate text using an "autoregressive" mechanism, meaning the model must generate one token at a time. Each generation…

Latest in AI

GLM-5.2: World's Top Open Frontend Coding Model + IndexShare Speculative Decoding

Why MoE Models Benefit More from Speculative Decoding

Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

Packed twin inference doubles Qwen3.6-27B throughput on one MI50

Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server★ 72

llama.cpp Gemma4 MTP Support Merged

在 Intel Core Ultra 上利用深度剪枝草稿模型加速 Qwen3-8B Agent★ 75

你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82

使用自投機解碼（Self-Speculative Decoding）加速文本生成：Meta 推出 LayerSkip 技術★ 78

Universal Assisted Generation：支援任意輔助模型的通用輔助生成技術，大幅提升解碼速度★ 85

透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75

Intel Gaudi 支援更快的輔助生成（Assisted Generation），顯著提升 LLM 推理速度

使用 Hugging Face Inference Endpoints 實現高效能 ASR、語者辨識與投機解碼★ 75

使用 🤗 Optimum Intel 在 Xeon 處理器上加速 StarCoder：Q8/Q4 量化與投機解碼

使用投機解碼（Speculative Decoding）將 Whisper 推論速度提升 2 倍★ 75

在生產環境中優化你的大語言模型 (LLM) — Hugging Face 實戰指南★ 85

Hugging Face 推出 Assisted Generation：邁向低延遲文本生成的新方向★ 85