r/LocalLLaMA top dayJun 11, 2026, 3:28 AM/u/pmttyji

NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face

Original: nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face

NVIDIA publishes a 4-bit quantized DiffusionGemma 26B model delivering 1,100+ tokens per second via discrete diffusion on H100 GPUs.

NVIDIA has released DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized version of Google DeepMind's open-weights multimodal model. Built on a Mixture-of-Experts architecture with 25.2B total but only 3.8B active parameters, it generates text in parallel 256-token blocks using discrete diffusion, exceeding 1,100 tokens per second on H100 hardware. The model supports a 256K-token context, text/image/video inputs, native function calling, reasoning mode, and 35+ languages.

NVIDIA has published DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized variant of Google DeepMind's open-weights DiffusionGemma 26B A4B IT multimodal model. The quantization is applied using NVIDIA's Model Optimizer tool with the NVFP4 format — a 4-bit floating-point precision designed to maximize throughput on NVIDIA Hopper-generation GPUs, particularly the H100.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on r/LocalLLaMA top day →

open-source other hugging-face nvidia-model-optimizer #discrete-diffusion #mixture-of-experts #quantization #multimodal #open-weights

Summaries are AI-generated; the original article is authoritative.