r/LocalLLaMA top dayJun 10, 2026, 6:51 AM/u/Think_Illustrator188

Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts

Original: Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Gemma 4 12B unified model stops attending to audio input when the system prompt exceeds roughly 21k tokens, reproduced across three inference stacks.

A developer building a single-pass voice assistant with Gemma 4 12B unified (encoder-free audio/vision/text model) finds that audio attention collapses once the system prompt grows to ~21k tokens. The model then ignores or hallucinates instead of responding to the spoken input. The issue reproduces identically on vLLM, llama.cpp, and LiteRT-LM, pointing to an architectural attention-saturation limit rather than a stack-specific bug.

This Reddit post comes from a developer locally deploying the Gemma 4 12B unified model on an NVIDIA GB10 (Blackwell architecture), attempting to use the model's native audio understanding capabilities to build a streamlined voice assistant architecture.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on r/LocalLLaMA top day →

gemini open-source vllm llama-cpp litert-lm #multimodal #audio-llm #local-inference #context-length #voice-assistant

Summaries are AI-generated; the original article is authoritative.