Hacker News (AI keywords)Jun 10, 2026, 4:09 PMmeetpateltechimportant 76

DiffusionGemma: 4x Faster Text Generation

Google introduced DiffusionGemma, an experimental open text diffusion model for faster local, interactive generation.

Google released DiffusionGemma, a 26B MoE experimental open model using text diffusion instead of token-by-token autoregressive decoding. It can generate blocks of text in parallel, reaching up to 4x faster output on dedicated GPUs. The model targets local, speed-sensitive workflows, but Google says its output quality is below standard Gemma 4 and recommends Gemma 4 for quality-critical production use.

Google has introduced DiffusionGemma, positioned as an experimental open model for exploring the speed potential of “text diffusion” in large language models. It is a 26B Mixture of Experts model that activates only 3.8B parameters during inference and is released under the Apache 2.0 license. Unlike common autoregressive LLMs, which generate token by token from left to right, DiffusionGemma generates an entire block of text at once and refines the result over multiple iterations. Google says this can deliver up to 4x faster text generation on dedicated GPUs, such as more than 1000 tokens/s on a single NVIDIA H100 and more than 700 tokens/s on an RTX 5090. Its advantages mainly appear in local, low-concurrency scenarios that require real-time interaction, because traditional autoregressive models can fully utilize hardware through batching in high-concurrency cloud settings, but in single-user local inference they are more likely to leave the GPU waiting for the next token. DiffusionGemma shifts the bottleneck from memory bandwidth to computation, allowing accelerators to process larger blocks at once. Each forward pass can process 256 tokens in parallel, and its bidirectional attention may be more favorable for nonlinear tasks such as inline editing, code infilling, amino acid sequences, mathematical graph structures, and Markdown closure. Google also lists important limitations: because the model prioritizes speed and parallel generation, its overall output quality is lower than standard Gemma 4, so applications with high generation-quality requirements should still deploy Gemma 4; in high-QPS cloud services, the speed advantage may diminish or even increase costs; and unified-memory architectures such as Apple Silicon may not see the same acceleration. Google provides Hugging Face weights and says the model can be used with tools including MLX, vLLM, Hugging Face Transformers, Hackable Diffusion, Unsloth, NVIDIA NeMo, and NVIDIA NIM, while official llama.cpp support is coming soon.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hacker News (AI keywords) →

Summaries are AI-generated; the original article is authoritative.