OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OSCAR uses precomputed spectral covariance-aware rotation matrices to enable 2-bit KV cache quantization with minimal quality loss.
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
OSCAR(Offline Spectral Covariance-Aware Rotation)是一套針對大型語言模型(LLM)推理記憶體優化的 KV Cache 量化技術,核心目標是將注意力機制中的 Key-Value 快取壓縮至 2-bit 精度,同時盡量保留模型輸出品質。
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Related
Summaries are AI-generated; the original article is authoritative.