DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao; Qiang Huang; Yaowei Wang; Min Zhang; Jun Yu

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

DeltaKV：基于长程相似性的残差式 KV Cache 压缩

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

DeltaKV studies long-range similarity in KV representations and compresses long-context LLM memory by storing semantic residuals relative to historical references instead of discarding tokens.

DeltaKV 研究 KV 表征中的长程相似性，通过将语义表示编码为相对历史参考的残差来压缩长上下文 LLM 显存，而不是直接丢弃 token。

arXiv 2026 KV Cache Compression Sparse-vLLM

PDF arXiv Code Zhihu

Method View

方法示意

Long-range similarity 长程相似性 KV states contain reusable semantic patterns across distant context positions. KV 状态在远距离上下文位置之间存在可复用的语义模式。

Residual encoding 残差编码 DeltaKV stores compact residuals relative to historical references. DeltaKV 存储相对历史参考的紧凑残差。

Near-lossless decoding 近无损解码 The method reduces memory while preserving token coverage for long-context tasks. 方法在降低显存的同时保留长上下文任务所需的 token 覆盖。

29%Reported KV memory target under DeltaKV compressionDeltaKV 报告的 KV 显存压缩比例

2xReported throughput gain in the paper overview论文概览中报告的吞吐提升

Drop-freeCompression without directly discarding tokens不直接丢弃 token 的压缩路线

Overview

论文概览

DeltaKV belongs to the efficient long-context LLM inference line. It addresses the memory pressure of KV cache by exploiting redundancy across distant context positions rather than pruning away context outright.

DeltaKV 属于高效长上下文 LLM 推理方向，针对 KV cache 显存压力问题，利用远距离上下文位置之间的冗余，而不是直接剪掉上下文。

Find Similarity

发现相似性

Identifies long-range similarity in KV representations across extended contexts.

识别长上下文 KV 表征中的长程相似性。

Store Residuals

存储残差

Encodes semantic residuals relative to historical references to reduce memory.

将语义表示编码为相对历史参考的残差，以降低显存。

Improve Serving

提升服务效率

Targets near-lossless long-context inference with better throughput.

面向近无损的长上下文推理和更高吞吐。

Related Research Areas

Key Takeaways

核心要点

Problem: long-context LLM inference makes KV cache a dominant memory bottleneck.
问题：长上下文 LLM 推理中，KV cache 会成为主要显存瓶颈。
Idea: use long-range similarity to encode current KV states as residuals relative to historical references.
核心想法：利用长程相似性，把当前 KV 状态编码为相对历史参考的残差。
Evidence: the paper overview reports KV memory reduced to 29%, near-lossless performance on SCBench and AIME, and 2x throughput gain.
实验结果：论文概览报告 KV 显存降至 29%，在 SCBench、AIME 等任务上接近无损，并实现 2x 吞吐提升。

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

DeltaKV：基于长程相似性的残差式 KV Cache 压缩

Overview

论文概览

Find Similarity

发现相似性

Store Residuals

存储残差

Improve Serving

提升服务效率

Related Research Areas

相关研究方向

Key Takeaways

核心要点

Resources

相关资源