Latest in AI

Showing:memory-optimizationDevelopersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

FlashMemory-DeepSeek-V4: Ultra-Long Context via Lookahead Sparse Attention
r/LocalLLaMA top day47 days agoPaper
FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), a predictive inference paradigm that retains only query-critical KV chunks in GPU memory instead of the full cache. A Neural Memory Indexer, trained independently using a backbone-free dual-encoder strategy, proactively forecasts which historical tokens will matter next. The system compresses average KV cache footprint by 86.5% and exceeds 90% compression at 500K-token scales, while delivering a slight accuracy gain of +0.6% on long-context benchmarks.
使用 PyTorch FSDP 高效微調 Llama 2 70B：解決 CPU 記憶體不足的實務指南★ 72
Hugging Face Blog1,049 days agoTutorial
When fine-tuning massively large open-source models like Llama 2 70B — with its 70 billion parameters — developers frequently encounter a bottleneck that goes…