Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Jitai Hao; Hao Liu; Xinyan Xiao; Qiang Huang; Jun Yu

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Uni-X：用两端分离架构缓解统一多模态模型中的模态冲突

Jitai Hao*, Hao Liu*, Xinyan Xiao, Qiang Huang, Jun Yu Jitai Hao*, Hao Liu*, Xinyan Xiao, Qiang Huang, Jun Yu

Uni-X targets modality conflict in unified multimodal models. Its X-shaped, two-end-separated design keeps modality-specific paths at the ends while sharing a middle representation space.

Uni-X 面向统一多模态模型中的模态冲突问题。它采用 X-shaped “两端分离，中间共享”设计，在两端保留模态特定路径，同时共享中间表示空间。

ICLR 2026 Poster Unified Multimodal Two-End-Separated

PDF OpenReview Code

Method View

方法示意

Modality-specific ends 模态特定两端 Understanding and generation paths keep dedicated entry and exit structures. 理解与生成路径在输入输出两端保留各自结构。

Shared middle 中间共享 A common representation space connects modalities without forcing every component to be shared. 共享表示空间连接多模态，但不强制所有组件完全共享。

Reduced modality conflict 缓解模态冲突 The architecture separates conflicting gradients where needed and shares capacity where useful. 在必要处隔离冲突梯度，在有效处共享模型容量。

3BLanguage backbone scale in the expanded Uni-X setting扩展版 Uni-X 的语言底座规模

82.0GenEval score reported for image generation图像生成 GenEval 报告分数

XTwo-end-separated, middle-shared architecture两端分离，中间共享架构

Overview

论文概览

Uni-X belongs to the unified multimodal research line. It addresses a core training tension: a single model needs to support multiple modalities and tasks, but full sharing can create modality conflict.

Uni-X 属于统一多模态研究主线，关注一个核心训练矛盾：单一模型需要支持多模态、多任务，但完全共享容易引发模态冲突。

Separate the Ends

两端分离

Keeps modality-specific components where conflicts are most direct.

在冲突最直接的位置保留模态特定组件。

Share the Middle

中间共享

Maintains a shared representation space for unified multimodal capability.

保留共享表示空间，以支持统一多模态能力。

Balance Tasks

平衡任务

Mitigates conflicts between multimodal understanding and generation objectives.

缓解多模态理解与生成目标之间的冲突。

Method Framework

方法框架图

The original architecture figure contrasts a fully shared transformer with Uni-X's two-end-separated, middle-shared layout. The design follows the paper's observation that modality conflict is strongest in shallow and deep layers and weaker in the middle.

原始架构图对比了完全共享 Transformer 与 Uni-X 的“两端分离、中间共享”结构。设计依据是论文观察到模态冲突在浅层和深层最强，而中间层冲突较弱。

Original Uni-X architecture figure comparing a shared transformer with a two-end-separated middle-shared architecture for text and vision tokens. — Original Figure 4 from the Uni-X paper: the baseline shared transformer suffers gradient conflict at both ends, while Uni-X uses modality-specific T-Layers and V-Layers around a shared middle block.

Conflict diagnosis

冲突诊断

The paper measures text and vision gradients across layers and finds severe conflict near input and output, where low-level token statistics differ most strongly.

论文逐层测量文本与视觉梯度，发现冲突主要集中在输入和输出两端，因为低层 token 统计性质差异最大。

Two-end separation

两端分离

The first and final layers are split into text-specific and vision-specific paths, preventing early feature extraction and final token projection from forcing both modalities through one parameter path.

模型将最前和最后若干层拆成文本专用与视觉专用路径，避免早期特征抽取和最终 token 投影强行共享同一参数路径。

Middle sharing

中间共享

Intermediate layers remain shared because representations are more semantic there, preserving cross-modal fusion without the overhead of fully separate models.

中间层继续共享，因为此处表征更偏语义，有利于跨模态融合，同时避免完全分离模型带来的额外复杂度。

Experimental evidence

实验结论

The paper reports that Uni-X achieves the best controlled-training average score of 41.6, and scaled 3B / 4.5B Uni-X reaches 67.1 text average and 82 on GenEval, competitive with larger 7B AR-based UMMs.

论文报告 Uni-X 在受控训练设置下取得 41.6 的最佳平均分；扩展后的 3B / 4.5B Uni-X 文本平均分 67.1、GenEval 82，可与更大的 7B AR-based UMM 竞争。

Related Research Areas

Key Takeaways

核心要点

Problem: fully shared autoregressive multimodal transformers produce strong gradient conflicts in shallow and deep layers.
问题：完全共享的自回归多模态 Transformer 会在浅层和深层产生明显梯度冲突。
Idea: keep modality-specific layers at both ends while sharing the middle semantic block.
核心想法：在两端保留模态特定层，同时共享中间的语义层。
Evidence: Uni-X reports the best controlled-training average score of 41.6; scaled 3B / 4.5B Uni-X reaches 67.1 text average and 82 on GenEval, competitive with larger 7B AR-based UMMs.
实验结果：Uni-X 在受控训练设置下取得 41.6 的最佳平均分；扩展后的 3B / 4.5B Uni-X 文本平均分 67.1、GenEval 82，可与更大的 7B AR-based UMM 竞争。

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Uni-X：用两端分离架构缓解统一多模态模型中的模态冲突

Overview

论文概览

Separate the Ends

两端分离

Share the Middle

中间共享

Balance Tasks

平衡任务

Method Framework

方法框架图

Conflict diagnosis

冲突诊断

Two-end separation

两端分离

Middle sharing

中间共享

Experimental evidence

实验结论

Related Research Areas

相关研究方向

Key Takeaways

核心要点

Resources

相关资源