Latest in AI

Showing:inference-optimizationClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

GLM-5.2: World's Top Open Frontend Coding Model + IndexShare Speculative Decoding
Latent Space41 days agoRelease
GLM-5.2 has claimed the leading position worldwide among open models on frontend coding benchmarks, marking a significant milestone for the open-source AI ecosystem. The release is accompanied by IndexShare, a new method targeting speculative decoding to improve inference throughput and reduce serving latency. Together, the two developments advance both capability and deployment efficiency for teams building with open models.
OrcaRouter Multi-Model Teaming Matches and Surpasses Fable 5 at Low Cost
量子位 QbitAI43 days agoNew Tool
OrcaRouter is a multi-model routing system that claims to replicate — and in some benchmarks surpass — the performance of Anthropic's flagship Claude Fable 5 model at substantially lower inference cost. Rather than relying on a single expensive frontier model, it orchestrates a team of smaller, cheaper models to collectively achieve top-tier output quality. The approach signals growing maturity in LLM routing and ensemble strategies as practical cost-efficiency tools for production AI systems.
Why MoE Models Benefit More from Speculative Decoding
Cohere Blog46 days agoBenchmark
Cohere analyzes why speculative decoding behaves differently on Mixture-of-Experts models than on dense LLMs. Its benchmarks show MoE speedups can peak at moderate batch sizes because sparse expert routing keeps verification bandwidth-bound. The post also finds that temporal expert overlap and fixed overhead amortization make multi-token verification cheaper than simple worst-case models predict.
Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day48 days agoBenchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
r/LocalLLaMA top day48 days agoPaper
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G
r/LocalLLaMA top day48 days agoBenchmark
A public HuggingFace Spaces dashboard hosts a live competition where AI agents race to optimize Gemma 4 E4B inference throughput on a single NVIDIA A10G GPU. The challenge gamifies ML inference engineering, letting anyone watch agents explore quantization and scheduling strategies in real time. Optimization recipes surfaced by the competition offer practical value for developers targeting single-GPU self-hosted Gemma 4 deployments.
你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82
Hugging Face Blog320 days agoTutorial
### Background and the LLM Inference Bottleneck When running large language models (LLMs), autoregressive generation is inherently "memory-bandwidth-bound"…
讓你的 ZeroGPU Spaces 速度飛起：利用 PyTorch AOT 提前編譯技術消除冷啟動延遲★ 75
Hugging Face Blog329 days agoTutorial
Hugging Face's ZeroGPU Spaces offers developers a free and efficient way to deploy GPU-accelerated AI applications. However, ZeroGPU uses a dynamic allocation…
使用 Diffusers 與 PEFT 實現 Flux 的快速 LoRA 推論★ 80
Hugging Face Blog370 days agoTutorial
This technical guide from Hugging Face takes an in-depth look at how to accelerate LoRA (Low-Rank Adaptation) inference for Flux.1, the powerful open-source…
從零開始在 nanoVLM 中實作 KV Cache★ 75
Hugging Face Blog419 days agoTutorial
In the inference process of large language models (LLMs) and vision-language models (VLMs), autoregressive decoding is a major performance bottleneck. Each…
Bamba：高推論效率的混合 Mamba2 開源模型正式發布★ 75
Hugging Face Blog587 days agoRelease
### Background and Architectural Innovation As large language models (LLMs) have advanced rapidly, the traditional Transformer architecture faces severe…
使用自投機解碼（Self-Speculative Decoding）加速文本生成：Meta 推出 LayerSkip 技術★ 78
Hugging Face Blog615 days agoRelease
The slow autoregressive generation speed of large language models (LLMs) has long been a major bottleneck in real-world deployment. While "speculative…
Universal Assisted Generation：支援任意輔助模型的通用輔助生成技術，大幅提升解碼速度★ 85
Hugging Face Blog637 days agoRelease
In the deployment and inference of large language models (LLMs), reducing generation latency has always been a critical challenge. The traditional approach of…
透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75
Hugging Face Blog658 days agoRelease
Hugging Face has published a technical blog post on "Dynamic Speculation," aimed at optimizing the inference speed of large language models (LLMs)…
Intel Gaudi 支援更快的輔助生成（Assisted Generation），顯著提升 LLM 推理速度
Hugging Face Blog784 days agoRelease
Hugging Face, in collaboration with Intel, has announced official support for "Assisted Generation" (also commonly known as Speculative Decoding) on Intel…
使用 ONNX Runtime 與 Olive 加速 SD Turbo 和 SDXL Turbo 推論★ 75
Hugging Face Blog925 days agoTutorial
SD Turbo and SDXL Turbo are single-step/few-step text-to-image models from Stability AI, with their core innovation being Adversarial Diffusion Distillation…
使用投機解碼（Speculative Decoding）將 Whisper 推論速度提升 2 倍★ 75
Hugging Face Blog951 days agoTutorial
The Hugging Face official blog introduces how to use "Speculative Decoding" to more than double the inference speed of OpenAI's Whisper speech-to-text model…
探索 SDXL 的簡單優化方法：大幅提升速度與節省顯存的實用指南★ 75
Hugging Face Blog1,008 days agoTutorial
Stable Diffusion XL (SDXL) is a powerful but architecturally large text-to-image model whose parameter count far exceeds that of the previous SD 1.5, placing…
使用 ONNX Runtime 加速超過 130,000 個 Hugging Face 模型★ 75
Hugging Face Blog1,028 days agoNew Tool
Hugging Face officially announced a deep collaboration with Microsoft to integrate ONNX Runtime (ORT) into the Hugging Face ecosystem. This partnership enables…
使用 JAX 與 Cloud TPU v5e 加速 Stable Diffusion XL 推理★ 70
Hugging Face Blog1,029 days agoTutorial
With the widespread adoption of high-quality open-source image generation models like Stable Diffusion XL (SDXL), reducing inference latency and controlling…
介紹 RWKV：兼具 Transformer 優勢的全新 RNN 架構★ 75
Hugging Face Blog1,170 days agoRelease
Hugging Face has announced official support for RWKV (Receptive Weighted Key Value) models in its `transformers` library. RWKV is an innovative architecture…
Hugging Face 推出 Assisted Generation：邁向低延遲文本生成的新方向★ 85
Hugging Face Blog1,174 days agoRelease
Large language models (LLMs) typically generate text using an "autoregressive" mechanism, meaning the model must generate one token at a time. Each generation…
使用 TensorFlow 與 XLA 加速文本生成
Hugging Face Blog1,462 days agoTutorial
This Hugging Face technical blog post takes an in-depth look at how to use TensorFlow's XLA (Accelerated Linear Algebra) compiler to dramatically speed up the…
使用 Hugging Face Transformers 與 AWS Inferentia 加速 BERT 推論
Hugging Face Blog1,595 days agoTutorial
When deploying large language models such as BERT in production environments, inference latency and computational cost are often two major pain points for…
Hugging Face Transformers 中的 TensorFlow 模型加速與 TF Serving 部署指南
Hugging Face Blog2,009 days agoTutorial
When deploying Transformer models in production environments, latency and throughput are often the deciding factors for a project's success. Hugging Face…

Latest in AI

GLM-5.2: World's Top Open Frontend Coding Model + IndexShare Speculative Decoding

OrcaRouter Multi-Model Teaming Matches and Surpasses Fable 5 at Low Cost

Why MoE Models Benefit More from Speculative Decoding

Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G

你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82

讓你的 ZeroGPU Spaces 速度飛起：利用 PyTorch AOT 提前編譯技術消除冷啟動延遲★ 75

使用 Diffusers 與 PEFT 實現 Flux 的快速 LoRA 推論★ 80

從零開始在 nanoVLM 中實作 KV Cache★ 75

Bamba：高推論效率的混合 Mamba2 開源模型正式發布★ 75

使用自投機解碼（Self-Speculative Decoding）加速文本生成：Meta 推出 LayerSkip 技術★ 78

Universal Assisted Generation：支援任意輔助模型的通用輔助生成技術，大幅提升解碼速度★ 85

透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75

Intel Gaudi 支援更快的輔助生成（Assisted Generation），顯著提升 LLM 推理速度

使用 ONNX Runtime 與 Olive 加速 SD Turbo 和 SDXL Turbo 推論★ 75

使用投機解碼（Speculative Decoding）將 Whisper 推論速度提升 2 倍★ 75

探索 SDXL 的簡單優化方法：大幅提升速度與節省顯存的實用指南★ 75

使用 ONNX Runtime 加速超過 130,000 個 Hugging Face 模型★ 75

使用 JAX 與 Cloud TPU v5e 加速 Stable Diffusion XL 推理★ 70

介紹 RWKV：兼具 Transformer 優勢的全新 RNN 架構★ 75

Hugging Face 推出 Assisted Generation：邁向低延遲文本生成的新方向★ 85

使用 TensorFlow 與 XLA 加速文本生成

使用 Hugging Face Transformers 與 AWS Inferentia 加速 BERT 推論

Hugging Face Transformers 中的 TensorFlow 模型加速與 TF Serving 部署指南