Latest in AI

Showing:sparse-attentionDevelopersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

FlashMemory-DeepSeek-V4: Ultra-Long Context via Lookahead Sparse Attention
r/LocalLLaMA top day47 days agoPaper
FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), a predictive inference paradigm that retains only query-critical KV chunks in GPU memory instead of the full cache. A Neural Memory Indexer, trained independently using a backbone-free dual-encoder strategy, proactively forecasts which historical tokens will matter next. The system compresses average KV cache footprint by 86.5% and exceeds 90% compression at 500K-token scales, while delivering a slight accuracy gain of +0.6% on long-context benchmarks.