Latest in AI

Showing:ggufClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060
r/LocalLLaMA top day47 days agoCommentary
A Reddit user with an RTX 3060 12GB and 32GB DDR3 RAM is evaluating new QAT-based Gemma 31B GGUF quantizations. They currently run an older Unsloth Gemma 31B IQ3_XXS build at long context, with some tensor and mmproj offloading to CPU. The post asks which Q2-Q3 quant to choose, whether QAT changes quality expectations, and whether MTP would help or hurt under tight VRAM limits.
Unsloth releases GGUF version of Cohere North-Mini-Code 1.0 (30B A3B MoE) on Hugging Face
r/LocalLLaMA top day48 days agoRelease
Unsloth uploaded a GGUF version of Cohere's North-Mini-Code 1.0 to Hugging Face, making local inference possible for this 30B A3B MoE coding-focused model. The poster links the release to llama.cpp PR #24260, suggesting new architecture support may be required. No benchmarks or test results have been shared yet; this is an early community resource post.
Unsloth Gemma 4 QAT MTP assistant models now available
r/LocalLLaMA top day48 days agoRelease
A r/LocalLLaMA post notes that Unsloth’s Gemma 4 QAT MTP assistant models are now available in GGUF format. The root directories include q8_0 files named mtp-gemma-4-*.gguf, while MTP folders contain q8_0 and larger quantized variants. The listed releases cover 12B, 26B-A4B, 31B, E2B, E2B mobile, E4B, and E4B mobile it-qat-GGUF repositories.
Pipeline parallelism in llama.cpp may be wasting your VRAM
r/LocalLLaMA top day49 days agoBenchmark
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.
Quick note on recent QAT issues
r/LocalLLaMA top day49 days agoCommentary
The post argues that recent Google QAT quantization has several implementation problems, including token embeddings being quantized to q6k instead of using a pure mode. It also claims llama-quantize has a hardcoded parameter that mismatches some optimized groups, and that 32-block groups are misaligned. The author recommends Unsloth UD Q4_K_XL as a temporary option and says they are working on a patch.
Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs
r/LocalLLaMA top day49 days agoBenchmark
The post benchmarks eight Qwen3.6-35B-A3B GGUF quants from ByteShape and Unsloth using llama.cpp and tool-eval-bench. It compares f16, q8_0, and q4_0 KV cache quantization under short and long-context pressure, totaling 144 runs and roughly 300 GPU-hours. The author reports no clear ByteShape versus Unsloth winner, q8_0 as close to a free lunch, q4_0 as weaker, and long context as a major tool-calling degradation factor.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72
r/LocalLLaMA top day49 days agoNew Tool
Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL
r/LocalLLaMA top day50 days agoCommentary
An analysis of Gemma 4 QAT GGUF files reveals that Google's official 'Q4_0' releases actually employ a mixed-precision strategy. For smaller models like E2B and E4B, Google keeps critical token embeddings in Q6_K and certain projection weights in F16. This makes Google's Q4_0 files larger and more precise than Unsloth's 'Q4_K_XL' versions, which default to standard Q4_0 for almost all tensors.
MTP and QAT: What is the Relation? Running Gemma 4 31B in llama.cpp
r/LocalLLaMA top day50 days agoCommentary
A popular Reddit thread addresses user confusion over running Gemma 4 31B locally. It distinguishes between MTP (Multi-Token Prediction for inference speedup) and QAT (Quantization-Aware Training for preserving 4-bit quality). It also confirms that llama.cpp's new MTP support requires updated GGUF files and a secondary draft model file for acceleration.
NVFP4 Support Merged in llama.cpp: How to Use 4-bit Blackwell Quantization
r/LocalLLaMA top day50 days agoCommentary
Following the merge of native NVFP4 (NVIDIA FP4) support in llama.cpp, users are exploring how to leverage this format on Blackwell GPUs (such as the RTX 50-series). The discussion focuses on converting NVFP4 safetensors (like Gemma 4 QAT) to GGUF format and whether importance matrices (imatrix) are required. This enablement promises significant performance gains for local LLM execution on next-gen hardware.
Gemma-4-26B-A4B QAT Variant Performs Poorly in llama.cpp Compared to Non-QAT Version
r/LocalLLaMA top day50 days agoBenchmark
A LocalLLaMA user highlighted that the newly released QAT (Quantization-Aware Training) variant of Google's Gemma-4-26B-A4B model underperforms compared to its non-QAT predecessor. Testing via llama.cpp on a chessboard SVG generation task showed significant rendering errors in the QAT version. The non-QAT GGUF version, however, produced highly accurate results under identical settings.
Qwen3.6 35B-A3B on a Laptop: A Local LLM "Zero to One" Milestone
r/LocalLLaMA top day50 days agoOpinion
A Reddit user detailed running Qwen3.6 35B-A3B (IQ3_XXS quantization) on an ASUS Zenbook Pro 14 (RTX 4060 8GB VRAM, 64GB RAM). Using llama.cpp, they achieved 27 TPS at 32k context and 18 TPS at 256k context. This setup serves as a highly capable, fully private local agent for file operations, CLI execution, and brainstorming, bypassing cloud privacy concerns.
start-llama: A Handy CLI Launcher for llama-server with Easy Customization
r/LocalLLaMA top day50 days agoNew Tool
A developer has released 'start-llama', a command-line utility designed to simplify launching llama-server (llama.cpp). It allows users to manage sensible default configurations, support multiple server binaries, and apply per-model or command-line overrides. This tool streamlines local LLM deployment into a single, easily configurable step.
GGML 與 llama.cpp 正式加入 Hugging Face，攜手保障本地端 AI 的長期發展★ 95
Hugging Face Blog158 days agoBusiness
A historic milestone has arrived in the open-source AI world: GGML and llama.cpp — the open-source projects founded by Georgi Gerganov that laid the…
llama.cpp 全新功能：更強大的模型管理機制（Model Management）與 Hugging Face 深度整合★ 85
Hugging Face Blog228 days agoRelease
The popular local large language model (LLM) inference tool `llama.cpp` has recently partnered with Hugging Face to launch a new "Model Management" mechanism…
GGML 基礎入門介紹：讓大語言模型在消費級硬體上高效運行的關鍵技術★ 80
Hugging Face Blog714 days agoTutorial
GGML is a lightweight, zero-dependency C/C++ tensor library developed by Georgi Gerganov. It was originally designed to enable efficient local inference of the…

Latest in AI

LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060

Unsloth releases GGUF version of Cohere North-Mini-Code 1.0 (30B A3B MoE) on Hugging Face

Unsloth Gemma 4 QAT MTP assistant models now available

Pipeline parallelism in llama.cpp may be wasting your VRAM

Quick note on recent QAT issues

Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72

Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL

MTP and QAT: What is the Relation? Running Gemma 4 31B in llama.cpp

NVFP4 Support Merged in llama.cpp: How to Use 4-bit Blackwell Quantization

Gemma-4-26B-A4B QAT Variant Performs Poorly in llama.cpp Compared to Non-QAT Version

Qwen3.6 35B-A3B on a Laptop: A Local LLM "Zero to One" Milestone

start-llama: A Handy CLI Launcher for llama-server with Easy Customization

GGML 與 llama.cpp 正式加入 Hugging Face，攜手保障本地端 AI 的長期發展★ 95

llama.cpp 全新功能：更強大的模型管理機制（Model Management）與 Hugging Face 深度整合★ 85

GGML 基礎入門介紹：讓大語言模型在消費級硬體上高效運行的關鍵技術★ 80