DiffusionGemma: 4x faster text generation | EveryCorner

Google has introduced DiffusionGemma, an experimental open model centered on text diffusion and released under the Apache 2.0 license. It builds on the parameter efficiency of the Gemma 4 series and Gemini Diffusion research, using a 26B Mixture of Experts architecture while activating only 3.8B parameters at inference time; after quantization, it can fit within the 18GB VRAM range of high-end consumer discrete GPUs. DiffusionGemma is not intended to replace the standard Gemma 4, but to explore speed-first, highly interactive local inference scenarios. Traditional autoregressive LLMs generate token by token from left to right, and in high-concurrency cloud services, batching can fully utilize hardware; but in single-user, local, or low-concurrency settings, GPUs are often underused because of sequential decoding. DiffusionGemma instead processes text blocks of 256 tokens at once, allowing bidirectional attention between tokens and gradually converging on the output through multiple rounds of iterative refinement. Google says this can shift the bottleneck from memory bandwidth to compute, achieving up to 4x faster generation on dedicated GPUs, with H100 exceeding 1000 tokens/s and RTX 5090 exceeding 700 tokens/s. This architecture is especially suited to tasks that need visibility into both preceding and following context, such as inline editing, rapid iteration, code infilling, nonlinear text structures, amino acid sequences, or mathematical graphs. Google also emphasizes the trade-off: DiffusionGemma is optimized for speed and parallel generation, and its overall quality is lower than the standard Gemma 4; if an application prioritizes quality above all else, Google still recommends using the standard Gemma 4. In terms of ecosystem support, the model weights are available on Hugging Face and can be used with MLX, vLLM, Hugging Face Transformers, Hackable Diffusion, Unsloth, and NVIDIA NeMo, with official llama.cpp support also announced as coming soon. Google has also worked with NVIDIA to optimize the hardware stack, covering deployment paths including RTX 5090, 4090, Hopper, Blackwell, DGX Spark, DGX Station, RTX PRO, and NVIDIA NIM.