DiffusionGemma: 4x Faster Text Generation | EveryCorner

Google has introduced DiffusionGemma, positioned as an experimental open model for exploring the speed potential of “text diffusion” in large language models. It is a 26B Mixture of Experts model that activates only 3.8B parameters during inference and is released under the Apache 2.0 license. Unlike common autoregressive LLMs, which generate token by token from left to right, DiffusionGemma generates an entire block of text at once and refines the result over multiple iterations. Google says this can deliver up to 4x faster text generation on dedicated GPUs, such as more than 1000 tokens/s on a single NVIDIA H100 and more than 700 tokens/s on an RTX 5090. Its advantages mainly appear in local, low-concurrency scenarios that require real-time interaction, because traditional autoregressive models can fully utilize hardware through batching in high-concurrency cloud settings, but in single-user local inference they are more likely to leave the GPU waiting for the next token. DiffusionGemma shifts the bottleneck from memory bandwidth to computation, allowing accelerators to process larger blocks at once. Each forward pass can process 256 tokens in parallel, and its bidirectional attention may be more favorable for nonlinear tasks such as inline editing, code infilling, amino acid sequences, mathematical graph structures, and Markdown closure. Google also lists important limitations: because the model prioritizes speed and parallel generation, its overall output quality is lower than standard Gemma 4, so applications with high generation-quality requirements should still deploy Gemma 4; in high-QPS cloud services, the speed advantage may diminish or even increase costs; and unified-memory architectures such as Apple Silicon may not see the same acceleration. Google provides Hugging Face weights and says the model can be used with tools including MLX, vLLM, Hugging Face Transformers, Hackable Diffusion, Unsloth, NVIDIA NeMo, and NVIDIA NIM, while official llama.cpp support is coming soon.