OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs

Jitai Hao; Yuke Zhu; Tian Wang; Jun Yu; Xin Xin; Bo Zheng; Zhaochun Ren; Sheng Guo

OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs

OmniKV：面向高效长上下文 LLM 的动态上下文选择

Jitai Hao*, Yuke Zhu*, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, Sheng Guo Jitai Hao*, Yuke Zhu*, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, Sheng Guo

OmniKV studies how long-context LLMs can keep the most useful context while reducing unnecessary computation. The method uses inter-layer attention similarity to dynamically select crucial context information for efficient long-context reasoning.

OmniKV 研究长上下文 LLM 如何保留最有价值的上下文，同时减少不必要的计算。方法利用层间注意力相似性，动态选择关键上下文信息，提升长文本处理效率与性能。

ICLR 2025 Efficiency Long-context LLMs

PDF Code Zhihu

Method View

方法示意

Long context request 长上下文请求 The model receives a long sequence where not every token contributes equally at every layer. 模型接收长序列输入，但每一层中并非所有 token 都同等重要。

Inter-layer attention similarity 层间注意力相似性 Attention patterns provide reusable signals for identifying important context positions. 注意力模式提供可复用信号，用于识别更关键的上下文位置。

Dynamic context selection 动态上下文选择 OmniKV keeps important context available while reducing redundant processing cost. OmniKV 保留关键上下文，同时减少冗余处理开销。

KVKV-cache offloading and sparse visible contextKV-cache offloading 与稀疏可见上下文

LayerFilter layers share selected token indicesFilter layers 共享选择出的 token 索引

LongLong-context decoding beyond 32K tokens面向 32K 以上长上下文解码

Overview

论文概览

OmniKV is part of the efficient LLM research line. It focuses on a practical long-context question: how can a model avoid spending the same amount of computation on context that is not equally useful?

OmniKV 属于高效大模型研究主线，关注一个实际的长上下文问题：模型如何避免在不等价的信息上投入同样多的计算？

Select Context

选择上下文

Dynamically identifies important context information instead of treating all tokens uniformly.

动态识别关键上下文信息，而不是对所有 token 做均匀处理。

Use Layer Similarity

利用层间相似性

Uses inter-layer attention similarity as the signal behind context selection.

以层间注意力相似性作为上下文选择的重要依据。

Reduce Cost

降低成本

Targets better long-context efficiency while preserving useful information for downstream tasks.

在保留下游任务所需信息的同时，提高长上下文推理效率。

Method Framework

方法框架图

The original framework figure shows OmniKV's decode-stage system: a Context Selector identifies useful tokens in filter layers, while the Context Bank keeps full KV cache offloaded and only moves selected subsets back to GPU for sparse layers.

原始框架图展示了 OmniKV 的 decode 阶段系统：Context Selector 在 filter layers 中识别关键 token，Context Bank 保留完整 KV cache 并把选中的子集加载回 GPU 供稀疏层使用。

Original OmniKV framework figure showing Context Selector, Context Bank, GPU and CPU KV cache movement, and sparse attention layers. — Original Figure 2 from the OmniKV paper: filter layers select important tokens from the observation window, and sparse layers load the selected KV cache subset from CPU memory to GPU memory.

How the system runs

系统如何运行

During prefill, every layer performs full attention and builds KV cache. OmniKV then stores most non-filter layer cache in the CPU-side Context Bank, while keeping a small set of filter layers available for full attention and token selection.

在 prefill 阶段，所有层先执行完整注意力并生成 KV cache。随后 OmniKV 将多数非 filter layer 的 cache 存入 CPU 侧 Context Bank，同时保留少数 filter layers 做完整注意力和 token 选择。

Why it is drop-free

为什么不是直接丢 token

In decode, selected token indices are refreshed dynamically. The full cache is retained in the Context Bank, so tokens are not permanently discarded; only the visible GPU cache is made sparse for the current step.

在 decode 阶段，token 索引会动态更新。完整 cache 仍保存在 Context Bank 中，因此 token 不会被永久丢弃；只是当前步对 GPU 可见的 KV cache 变稀疏。

Efficiency mechanism

效率来源

Layers between neighboring filter layers share selected indices, allowing packed CPU-to-GPU loading. This reduces both attention sequence length and repeated transfer volume in long-context inference.

相邻 filter layers 之间的层共享选择索引，因此可以批量从 CPU 加载到 GPU，减少长上下文推理中的注意力长度和重复数据传输。

Experimental evidence

实验结论

The paper reports best or near-full-attention quality on LongBench and InfiniteBench, up to 75% KV-cache memory reduction with offloading, and 1.7x speedup at 128K context.

论文报告在 LongBench 和 InfiniteBench 上达到最佳或接近 full attention 的效果；结合 offloading 可最多减少 75% KV-cache 显存，并在 128K context 下达到 1.7x 加速。

Related Research Areas

Key Takeaways

核心要点

Problem: long-context decoding creates KV-cache memory pressure and unnecessary full-context attention.
问题：长上下文解码会带来 KV-cache 显存压力，并让模型在每一步都执行不必要的完整上下文注意力。
Idea: use filter layers and inter-layer attention similarity to select context dynamically without permanently dropping tokens.
核心想法：利用 filter layers 和层间注意力相似性动态选择上下文，同时不永久丢弃 token。
Evidence: OmniKV reports near-full-attention quality on LongBench and InfiniteBench, up to 75% KV-cache memory reduction with offloading, and 1.7x speedup at 128K context.
实验结果：OmniKV 在 LongBench 和 InfiniteBench 上接近 full attention 效果；结合 offloading 可最多减少 75% KV-cache 显存，并在 128K context 下达到 1.7x 加速。

OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs

OmniKV：面向高效长上下文 LLM 的动态上下文选择

Overview

论文概览

Select Context

选择上下文

Use Layer Similarity

利用层间相似性

Reduce Cost

降低成本

Method Framework

方法框架图

How the system runs

系统如何运行

Why it is drop-free

为什么不是直接丢 token

Efficiency mechanism

效率来源

Experimental evidence

实验结论

Related Research Areas

相关研究方向

Key Takeaways

核心要点

Resources

相关资源