A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao; Qiang Huang; Hao Liu; Xinyan Xiao; Zhaochun Ren; Jun Yu

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

一个 Token 抵过千个 Token：通过 Low-Rank Clone 实现高效知识蒸馏

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu

Low-Rank Clone (LRC) is an efficient pretraining method for small language models. It uses low-rank modules to clone teacher knowledge, making each training token carry much richer supervision.

Low-Rank Clone (LRC) 是一种面向小语言模型的高效预训练方法。它通过低秩模块克隆教师模型知识，让每个训练 token 承载更丰富的监督信号。

NeurIPS 2025 Spotlight Efficient KD Low-Rank Clone

PDF Code Zhihu

Method View

方法示意

Teacher signal 教师信号 A stronger model provides richer learning targets than raw next-token supervision alone. 强教师模型提供比单纯 next-token 训练更丰富的学习目标。

Low-rank clone modules 低秩克隆模块 Compact low-rank structures transfer teacher knowledge efficiently into the student. 紧凑的低秩结构将教师知识高效迁移到学生模型中。

Token-efficient student training Token 高效的学生训练 The student learns more from each token, reducing the need for massive pretraining corpora. 学生模型从每个 token 中学到更多信息，减少对海量预训练语料的依赖。

10B-20BTraining-token range for the reported LRC models论文报告的 LRC 模型训练 token 范围

1,000+Training-token efficiency over trillion-token SLM baselines相对万亿 token SLM 基线的训练 token 效率

SpotlightNeurIPS 2025 Spotlight paperNeurIPS 2025 Spotlight 论文

Overview

论文概览

LRC branches from the efficient LLM line into efficient knowledge distillation. The goal is to improve the training efficiency of small language models by turning each data token into a stronger learning signal.

LRC 从高效大模型主线分支到高效知识蒸馏方向。它的目标是提高小语言模型训练效率，让每个数据 token 变成更强的学习信号。

Clone Teacher Knowledge

克隆教师知识

Uses teacher supervision to enrich the signal available to the student model.

利用教师监督增强学生模型可获得的训练信号。

Low-Rank Transfer

低秩迁移

Employs low-rank modules as an efficient carrier for knowledge transfer.

以低秩模块作为高效知识迁移的载体。

Train With Fewer Tokens

减少训练 Token

Targets strong student models without requiring trillion-token scale pretraining.

目标是在不依赖万亿级 token 预训练的情况下获得强学生模型。

Method Framework

方法框架图

The original framework figure shows LRC's two core steps inside each layer: low-rank projection maps teacher weights into the student space, and activation clone aligns teacher and student intermediate activations.

原始框架图展示了 LRC 在每层中的两个核心步骤：Low-Rank Projection 将教师权重映射到学生空间，Activation Clone 对齐教师与学生的中间激活。

Low-rank projection

低秩投影

Instead of randomly initializing a smaller student, LRC learns compact projection matrices that convert teacher attention, FFN, embedding, and LM-head weights into the student's dimensionality.

LRC 不从随机初始化的小模型开始，而是学习紧凑的投影矩阵，把教师模型的 attention、FFN、embedding 与 LM-head 权重转换到学生模型维度。

Activation clone

激活克隆

Teacher and student run forward passes on the same data. LRC aligns intermediate states, including often-overlooked FFN activations, so each token carries layer-level teacher supervision.

教师和学生在同一数据上前向传播。LRC 对齐中间状态，尤其包含常被忽略的 FFN 激活，让每个 token 带有层级教师监督信号。

Training objective

训练目标

The method combines clone losses with language-model training so the student inherits useful representations while remaining a deployable standard transformer after training.

方法将 clone loss 与语言模型训练结合，使学生模型继承有效表征，同时训练后仍是可直接部署的标准 Transformer。

Experimental evidence

实验结论

The paper reports that LRC models match or surpass strong SLMs trained on trillions of tokens; LRC-1.7B reaches 64.98 average versus Qwen3-1.7B at 63.17 using over 1,000x fewer tokens.

论文报告 LRC 模型可匹配或超过使用万亿级 token 训练的强 SLM；LRC-1.7B 平均分 64.98，高于 Qwen3-1.7B 的 63.17，并使用超过 1000x 更少的训练 token。

Related Research Areas

Key Takeaways

核心要点

Problem: strong small language models typically require very large pretraining budgets.
问题：训练强小语言模型通常需要很高的预训练预算。
Idea: initialize and train a student through low-rank projection plus activation clone from a stronger teacher.
核心想法：通过低秩投影和 activation clone，从更强教师模型初始化并训练学生模型。
Evidence: LRC-1.7B reaches 64.98 average versus Qwen3-1.7B at 63.17 while using over 1,000x fewer training tokens.
实验结果：LRC-1.7B 平均分达到 64.98，高于 Qwen3-1.7B 的 63.17，并使用超过 1000x 更少的训练 token。

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

一个 Token 抵过千个 Token：通过 Low-Rank Clone 实现高效知识蒸馏

Overview

论文概览

Clone Teacher Knowledge

克隆教师知识

Low-Rank Transfer

低秩迁移

Train With Fewer Tokens

减少训练 Token

Method Framework

方法框架图

Low-rank projection

低秩投影

Activation clone

激活克隆

Training objective

训练目标

Experimental evidence

实验结论

Related Research Areas

相关研究方向

Key Takeaways

核心要点

Resources

相关资源