Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts
Original: Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?
Gemma 4 12B unified model stops attending to audio input when the system prompt exceeds roughly 21k tokens, reproduced across three inference stacks.
A developer building a single-pass voice assistant with Gemma 4 12B unified (encoder-free audio/vision/text model) finds that audio attention collapses once the system prompt grows to ~21k tokens. The model then ignores or hallucinates instead of responding to the spoken input. The issue reproduces identically on vLLM, llama.cpp, and LiteRT-LM, pointing to an architectural attention-saturation limit rather than a stack-specific bug.
This Reddit post comes from a developer locally deploying the Gemma 4 12B unified model on an NVIDIA GB10 (Blackwell architecture), attempting to use the model's native audio understanding capabilities to build a streamlined voice assistant architecture.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Related
Summaries are AI-generated; the original article is authoritative.