Google DeepMind BlogJun 10, 2026, 4:24 PMimportant 74

DiffusionGemma: 4x faster text generation

Google introduced DiffusionGemma, an experimental open text diffusion model for faster local, interactive generation.

Google’s DiffusionGemma is an Apache 2.0 experimental open model using text diffusion instead of standard autoregressive decoding. The 26B MoE model activates 3.8B parameters during inference and is designed for low-latency local workflows. Google claims up to 4x faster generation on dedicated GPUs, while noting that output quality is below standard Gemma 4 and production-quality use cases should still prefer Gemma 4.

Google has introduced DiffusionGemma, an experimental open model centered on text diffusion and released under the Apache 2.0 license. It builds on the parameter efficiency of the Gemma 4 series and Gemini Diffusion research, using a 26B Mixture of Experts architecture while activating only 3.8B parameters at inference time; after quantization, it can fit within the 18GB VRAM range of high-end consumer discrete GPUs. DiffusionGemma is not intended to replace the standard Gemma 4, but to explore speed-first, highly interactive local inference scenarios. Traditional autoregressive LLMs generate token by token from left to right, and in high-concurrency cloud services, batching can fully utilize hardware; but in single-user, local, or low-concurrency settings, GPUs are often underused because of sequential decoding. DiffusionGemma instead processes text blocks of 256 tokens at once, allowing bidirectional attention between tokens and gradually converging on the output through multiple rounds of iterative refinement. Google says this can shift the bottleneck from memory bandwidth to compute, achieving up to 4x faster generation on dedicated GPUs, with H100 exceeding 1000 tokens/s and RTX 5090 exceeding 700 tokens/s. This architecture is especially suited to tasks that need visibility into both preceding and following context, such as inline editing, rapid iteration, code infilling, nonlinear text structures, amino acid sequences, or mathematical graphs. Google also emphasizes the trade-off: DiffusionGemma is optimized for speed and parallel generation, and its overall quality is lower than the standard Gemma 4; if an application prioritizes quality above all else, Google still recommends using the standard Gemma 4. In terms of ecosystem support, the model weights are available on Hugging Face and can be used with MLX, vLLM, Hugging Face Transformers, Hackable Diffusion, Unsloth, and NVIDIA NeMo, with official llama.cpp support also announced as coming soon. Google has also worked with NVIDIA to optimize the hardware stack, covering deployment paths including RTX 5090, 4090, Hopper, Blackwell, DGX Spark, DGX Station, RTX PRO, and NVIDIA NIM.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Google DeepMind Blog →

Summaries are AI-generated; the original article is authoritative.