NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face
Original: nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face
NVIDIA publishes a 4-bit quantized DiffusionGemma 26B model delivering 1,100+ tokens per second via discrete diffusion on H100 GPUs.
NVIDIA has released DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized version of Google DeepMind's open-weights multimodal model. Built on a Mixture-of-Experts architecture with 25.2B total but only 3.8B active parameters, it generates text in parallel 256-token blocks using discrete diffusion, exceeding 1,100 tokens per second on H100 hardware. The model supports a 256K-token context, text/image/video inputs, native function calling, reasoning mode, and 35+ languages.
NVIDIA has published DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized variant of Google DeepMind's open-weights DiffusionGemma 26B A4B IT multimodal model. The quantization is applied using NVIDIA's Model Optimizer tool with the NVFP4 format — a 4-bit floating-point precision designed to maximize throughput on NVIDIA Hopper-generation GPUs, particularly the H100.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Related
Summaries are AI-generated; the original article is authoritative.