FlashMemory-DeepSeek-V4: Ultra-Long Context via Lookahead Sparse Attention
r/LocalLLaMA top day·23 hours ago·Paper
FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), a predictive inference paradigm that retains only query-critical KV chunks in GPU memory instead of the full cache. A Neural Memory Indexer, trained independently using a backbone-free dual-encoder strategy, proactively forecasts which historical tokens will matter next. The system compresses average KV cache footprint by 86.5% and exceeds 90% compression at 500K-token scales, while delivering a slight accuracy gain of +0.6% on long-context benchmarks.