Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts
r/LocalLLaMA top day·13 hours ago·Commentary
A developer building a single-pass voice assistant with Gemma 4 12B unified (encoder-free audio/vision/text model) finds that audio attention collapses once the system prompt grows to ~21k tokens. The model then ignores or hallucinates instead of responding to the spoken input. The issue reproduces identically on vLLM, llama.cpp, and LiteRT-LM, pointing to an architectural attention-saturation limit rather than a stack-specific bug.