DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

DeltaKV:基于长程相似性的残差式 KV Cache 压缩

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

DeltaKV studies long-range similarity in KV representations and compresses long-context LLM memory by storing semantic residuals relative to historical references instead of discarding tokens.

DeltaKV 研究 KV 表征中的长程相似性,通过将语义表示编码为相对历史参考的残差来压缩长上下文 LLM 显存,而不是直接丢弃 token。

arXiv 2026 KV Cache Compression Sparse-vLLM

Overview

论文概览

DeltaKV belongs to the efficient long-context LLM inference line. It addresses the memory pressure of KV cache by exploiting redundancy across distant context positions rather than pruning away context outright.

DeltaKV 属于高效长上下文 LLM 推理方向,针对 KV cache 显存压力问题,利用远距离上下文位置之间的冗余,而不是直接剪掉上下文。

Find Similarity

发现相似性

Identifies long-range similarity in KV representations across extended contexts.

识别长上下文 KV 表征中的长程相似性。

Store Residuals

存储残差

Encodes semantic residuals relative to historical references to reduce memory.

将语义表示编码为相对历史参考的残差,以降低显存。

Improve Serving

提升服务效率

Targets near-lossless long-context inference with better throughput.

面向近无损的长上下文推理和更高吞吐。

Related Research Areas

相关研究方向

DeltaKV connects residual-based representation compression, KV-cache memory reduction, and sparse-first LLM serving systems such as Sparse-vLLM.

DeltaKV 连接残差式表示压缩、KV-cache 显存优化,以及 Sparse-vLLM 这类稀疏优先的大模型服务系统。

DeltaKV KV cache compression long-context LLMs residual-based compression long-range similarity efficient inference Sparse-vLLM

Key Takeaways

核心要点

  • Problem: long-context LLM inference makes KV cache a dominant memory bottleneck.
  • 问题:长上下文 LLM 推理中,KV cache 会成为主要显存瓶颈。
  • Idea: use long-range similarity to encode current KV states as residuals relative to historical references.
  • 核心想法:利用长程相似性,把当前 KV 状态编码为相对历史参考的残差。
  • Evidence: the paper overview reports KV memory reduced to 29%, near-lossless performance on SCBench and AIME, and 2x throughput gain.
  • 实验结果:论文概览报告 KV 显存降至 29%,在 SCBench、AIME 等任务上接近无损,并实现 2x 吞吐提升。

Resources

相关资源

Paper, implementation, and Chinese write-up for DeltaKV.

DeltaKV 的论文、代码与中文解读。