Latest in AI

Showing:quantizationResearchersClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Production-Ready W4A8: vLLM Integration and Quality Recovery Techniques
Cohere Blog46 days agoTutorial
Cohere’s post appears to explain how W4A8 quantization can be prepared for production inference through vLLM integration. From the title, the focus is likely on deployment mechanics and techniques for recovering model quality after aggressive quantization. Because no article body is available, specific benchmarks, supported models, implementation steps, and measured quality gains cannot be confirmed.
NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face
r/LocalLLaMA top day47 days agoRelease
NVIDIA has released DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized version of Google DeepMind's open-weights multimodal model. Built on a Mixture-of-Experts architecture with 25.2B total but only 3.8B active parameters, it generates text in parallel 256-token blocks using discrete diffusion, exceeding 1,100 tokens per second on H100 hardware. The model supports a 256K-token context, text/image/video inputs, native function calling, reasoning mode, and 35+ languages.
LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060
r/LocalLLaMA top day47 days agoCommentary
A Reddit user with an RTX 3060 12GB and 32GB DDR3 RAM is evaluating new QAT-based Gemma 31B GGUF quantizations. They currently run an older Unsloth Gemma 31B IQ3_XXS build at long context, with some tensor and mmproj offloading to CPU. The post asks which Q2-Q3 quant to choose, whether QAT changes quality expectations, and whether MTP would help or hurt under tight VRAM limits.
Bonsai LM 1-bit and 1.58-bit Benchmarks on Jetson Orin Nano Super
r/LocalLLaMA top day47 days agoBenchmark
A LocalLLaMA post benchmarks five Bonsai LM models, from 1.7B to about 8B parameters, on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA. The tests compare 7W, 15W, 25W, and MAXN modes across latency, throughput, energy per token, and thermals. The main takeaway is that 25W is usually the best efficiency/performance point for models up to 4B, while Bonsai-8B may favor 15W for lower power.
Unsloth releases GGUF version of Cohere North-Mini-Code 1.0 (30B A3B MoE) on Hugging Face
r/LocalLLaMA top day48 days agoRelease
Unsloth uploaded a GGUF version of Cohere's North-Mini-Code 1.0 to Hugging Face, making local inference possible for this 30B A3B MoE coding-focused model. The poster links the release to llama.cpp PR #24260, suggesting new architecture support may be required. No benchmarks or test results have been shared yet; this is an early community resource post.
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
r/LocalLLaMA top day48 days agoPaper
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
Unsloth Gemma 4 QAT MTP assistant models now available
r/LocalLLaMA top day48 days agoRelease
A r/LocalLLaMA post notes that Unsloth’s Gemma 4 QAT MTP assistant models are now available in GGUF format. The root directories include q8_0 files named mtp-gemma-4-*.gguf, while MTP folders contain q8_0 and larger quantized variants. The listed releases cover 12B, 26B-A4B, 31B, E2B, E2B mobile, E4B, and E4B mobile it-qat-GGUF repositories.
Jetson Orin NX Build for Hermes Agent + Benchmarking
r/LocalLLaMA top day49 days agoHardware
The post describes turning an unused Jetson Orin NX into a compact local LLM server for Hermes Agent testing. The goals were low noise, over 10 tok/s generation, 300 tok/s prompt processing, at least 65K context, and a custom case. After testing Gemma 4, Qwen 3.6, and many quant variants, the author reports Gemma 4 26B A4B UD Q2_K_XL reaching 66K context and 10.21 tok/s near 60K context.
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants?
r/LocalLLaMA top day49 days agoBenchmark
A r/LocalLLaMA user is looking for benchmarks comparing Gemma 4 4-bit QAT models, via Unsloth, against standard 8-bit non-QAT quantized models. They understand QAT is expected to preserve much of the BF16 baseline accuracy, but want hard numbers against traditional 8-bit PTQ. The post highlights scattered feedback but no clear head-to-head evaluation yet.
ggml-webgpu improves prefill speeds for k-quants in llama.cpp PR
r/LocalLLaMA top day49 days agoBenchmark
llama.cpp PR #24225 improves ggml-webgpu matrix multiplication performance for k-quants and refactors matmul paths for Q4/Q5/Q8 and k-quants. In pp512 tests on an M2 Pro, reported speedups range from about 1.33x to 3.78x across Q2_K, Q3_K, Q4_K, Q5_K, and Q6_K. The largest gains appear on Q3_K models, including Qwen and Gemma examples.
Packed twin inference doubles Qwen3.6-27B throughput on one MI50
r/LocalLLaMA top day49 days agoBenchmark
A LocalLLaMA user shared an early packed-twin-inference experiment for local LLM acceleration. The idea resembles speculative decoding, but uses the same quantized model side-by-side instead of a smaller draft model. On a single AMD MI50, the author reports Qwen3.6-27B improving from 19.4 to 38.1 tk/s, with Q8-or-lower quantization as the main target.
Quick note on recent QAT issues
r/LocalLLaMA top day49 days agoCommentary
The post argues that recent Google QAT quantization has several implementation problems, including token embeddings being quantized to q6k instead of using a pure mode. It also claims llama-quantize has a hardcoded parameter that mismatches some optimized groups, and that 32-block groups are misaligned. The author recommends Unsloth UD Q4_K_XL as a temporary option and says they are working on a patch.
Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs
r/LocalLLaMA top day49 days agoBenchmark
The post benchmarks eight Qwen3.6-35B-A3B GGUF quants from ByteShape and Unsloth using llama.cpp and tool-eval-bench. It compares f16, q8_0, and q4_0 KV cache quantization under short and long-context pressure, totaling 144 runs and roughly 300 GPU-hours. The author reports no clear ByteShape versus Unsloth winner, q8_0 as close to a free lunch, q4_0 as weaker, and long context as a major tool-calling degradation factor.
Was BitNet a dead end? What happened to ternary LLMs?
r/LocalLLaMA top day49 days agoCommentary
A r/LocalLLaMA user questions whether BitNet and ternary LLMs were a dead end after earlier promise around efficient low-bit models. The post notes that the largest ternary model appears to remain around 2B parameters. It asks why frontier open-weight AI labs are not visibly pursuing the approach, but provides no technical evidence or definitive answer.
An Implementation of NanoQuant: A Flexible Binary Quantization Method
r/LocalLLaMA top day49 days agoNew Tool
A r/LocalLLaMA post presents an unofficial PyTorch implementation of NanoQuant, a 2026 post-training quantization method for dense transformers. The method factorizes weights into scaling vectors and binary matrices, then quantizes and fine-tunes blocks sequentially to reduce hardware requirements. Early Qwen3-0.6B and Qwen3-4B experiments are promising for base models, but instruct quality remains weak and highly dependent on calibration data.
What was your local daily driver for coding last week?
r/LocalLLaMA top day49 days agoCommentary
This r/LocalLLaMA post is a brief community poll asking users what their local coding daily driver was last week. The post asks commenters to share their favorite model and quant, but the provided text does not include poll options, results, or specific model names. Its value is mainly as a community signal for tracking local LLM coding preferences.
Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL
r/LocalLLaMA top day50 days agoCommentary
An analysis of Gemma 4 QAT GGUF files reveals that Google's official 'Q4_0' releases actually employ a mixed-precision strategy. For smaller models like E2B and E4B, Google keeps critical token embeddings in Q6_K and certain projection weights in F16. This makes Google's Q4_0 files larger and more precise than Unsloth's 'Q4_K_XL' versions, which default to standard Q4_0 for almost all tensors.
Gemma 4 31B FP8 Matches Claude Sonnet 4.6 Medium in Custom Benchmark★ 75
r/LocalLLaMA top day50 days agoBenchmark
A Reddit user shared benchmark results showing Google's Gemma 4 31B (FP8) performing on par with Claude Sonnet 4.6 Medium. The custom evaluation harness tested complex tasks including Neo4j Cypher queries, entity extraction, agentic tool calling, Python coding, and multi-vector retrieval synthesis. This highlights how quantized mid-sized open-source models are closing the gap with leading proprietary frontier models.
Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?
r/LocalLLaMA top day50 days agoCommentary
A popular Reddit thread on r/LocalLLaMA discusses the potential of 2-bit Quantization Aware Training (QAT) for large MoE models (120B to 400B). While current QAT efforts focus on 4-bit, users speculate whether a 2-bit QAT model could fit into consumer hardware (64GB/128GB RAM) and outperform a 4-bit model of half its size. This approach is proposed as a practical alternative to training ternary (1.58-bit) LLMs from scratch.
Gemma-4-26B-A4B QAT Variant Performs Poorly in llama.cpp Compared to Non-QAT Version
r/LocalLLaMA top day50 days agoBenchmark
A LocalLLaMA user highlighted that the newly released QAT (Quantization-Aware Training) variant of Google's Gemma-4-26B-A4B model underperforms compared to its non-QAT predecessor. Testing via llama.cpp on a chessboard SVG generation task showed significant rendering errors in the QAT version. The non-QAT GGUF version, however, produced highly accurate results under identical settings.
Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated
r/LocalLLaMA top day51 days agoBenchmark
Reddit user Anbeeld shared comprehensive KV cache quantization benchmarks for Qwen 3.6 27B across 75 configuration pairs. Using BeeLlama.cpp (a custom llama.cpp fork), the test evaluates q8, q6, q5, and q4 quantization levels. It specifically highlights advanced implementations like KVarN, TurboQuant, and TCQ to optimize long-context inference efficiency.
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency★ 72
Hacker News (AI keywords)52 days agoRelease
Google released new Gemma 4 checkpoints optimized with Quantization-Aware Training to preserve quality after compression. The release includes Q4_0 checkpoints and a mobile-focused quantization format that can reduce Gemma 4 E2B memory use to about 1GB, or below 1GB for a text-only configuration. The models are available through Hugging Face and supported across llama.cpp, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX, and Unsloth.
GGML 與 llama.cpp 正式加入 Hugging Face，攜手保障本地端 AI 的長期發展★ 95
Hugging Face Blog158 days agoBusiness
A historic milestone has arrived in the open-source AI world: GGML and llama.cpp — the open-source projects founded by Georgi Gerganov that laid the…
你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82
Hugging Face Blog320 days agoTutorial
### Background and the LLM Inference Bottleneck When running large language models (LLMs), autoregressive generation is inherently "memory-bandwidth-bound"…
Arm 與 ExecuTorch 0.7 聯手：將生成式 AI 推向大眾市場★ 80
Hugging Face Blog348 days agoRelease
As generative AI advances rapidly, deploying massive models to resource-constrained edge devices — such as smartphones, smart hardware, and AI PCs — has become…
Hugging Face Diffusers 量化後端深度探索：在消費級 GPU 高效運行大型擴散模型★ 80
Hugging Face Blog433 days agoTutorial
As diffusion models (such as Flux.1 and Stable Diffusion 3) continue to grow in parameter count — often reaching tens of billions or even hundreds of billions…
介紹 AutoRound：Intel 針對 LLM 與 VLM 的先進量化技術★ 75
Hugging Face Blog455 days agoRelease
As large language models (LLMs) and vision language models (VLMs) continue to scale up, running these models on limited hardware resources — such as…
Diffusers 庫中開源影片生成模型的最新現狀與技術解析★ 82
Hugging Face Blog547 days agoCommentary
This official Hugging Face blog post takes an in-depth look at the current state of open-source video generation models within the Diffusers ecosystem. As…
Open LLM Leaderboard 碳排放與模型性能分析：效能與環保的權衡啟示
Hugging Face Blog565 days agoCommentary
Hugging Face recently published an in-depth analysis of its well-known Open LLM Leaderboard, examining the carbon dioxide (CO₂) emissions generated during…
使用 Optimum-Intel 與 OpenVINO GenAI 進行模型優化與部署★ 75
Hugging Face Blog676 days agoTutorial
This article provides a detailed look at how to use Hugging Face's `optimum-intel` library and Intel's OpenVINO GenAI toolkit to optimize and deploy generative…

Page 1Next →

Latest in AI

Production-Ready W4A8: vLLM Integration and Quality Recovery Techniques

NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face

LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060

Bonsai LM 1-bit and 1.58-bit Benchmarks on Jetson Orin Nano Super

Unsloth releases GGUF version of Cohere North-Mini-Code 1.0 (30B A3B MoE) on Hugging Face

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Unsloth Gemma 4 QAT MTP assistant models now available

Jetson Orin NX Build for Hermes Agent + Benchmarking

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants?

ggml-webgpu improves prefill speeds for k-quants in llama.cpp PR

Packed twin inference doubles Qwen3.6-27B throughput on one MI50

Quick note on recent QAT issues

Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs

Was BitNet a dead end? What happened to ternary LLMs?

An Implementation of NanoQuant: A Flexible Binary Quantization Method

What was your local daily driver for coding last week?

Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL

Gemma 4 31B FP8 Matches Claude Sonnet 4.6 Medium in Custom Benchmark★ 75

Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?

Gemma-4-26B-A4B QAT Variant Performs Poorly in llama.cpp Compared to Non-QAT Version

Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency★ 72

GGML 與 llama.cpp 正式加入 Hugging Face，攜手保障本地端 AI 的長期發展★ 95

你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82

Arm 與 ExecuTorch 0.7 聯手：將生成式 AI 推向大眾市場★ 80

Hugging Face Diffusers 量化後端深度探索：在消費級 GPU 高效運行大型擴散模型★ 80

介紹 AutoRound：Intel 針對 LLM 與 VLM 的先進量化技術★ 75

Diffusers 庫中開源影片生成模型的最新現狀與技術解析★ 82

Open LLM Leaderboard 碳排放與模型性能分析：效能與環保的權衡啟示

使用 Optimum-Intel 與 OpenVINO GenAI 進行模型優化與部署★ 75