Black color text indicate original paper context，blue color means annotations and notes.

Model Architecture

文档解析的主要目标是多方面的：精确识别和分割结构组件，如文本块、列、数学公式、表格和图表；建立逻辑阅读顺序以保持语义连贯性；以及检测包括脚注和标题在内的辅助元素。

文档特性：多种格式和内容类型，例如学术出版物、法律文件和演示幻灯片，它们整合了包括文本、表格、公式和印章在内的异构元素。

文档解析范式：基于模块化流水线的系统和端到端多模态模型。

Youtu-Parsing将文档解析任务分解为三个协同阶段：共享视觉特征提取（shared visual feature extraction）、布局分析（layout analysis）和区域提示解码（region-prompted decoding）。

Youtu-Parsing解析流程

1）共享视觉特征提取模块利用 NaViT 生成共享视觉特征图，作为所有后续解析操作的统一表示。在此基础上，2）布局分析模块执行快速结构解析，以精确识别文档元素的边界框坐标和语义类别。最后，3）基于区域的内容查询模块为每个检测到的元素提取文本信息。

Youtu-Parsing优势

1）通过采用特定类别提示，该模块有效减轻了不同元素类型的异构输出需求所带来的风格干扰。这种解耦且集成的架构避免了传统流水线中典型的误差累积问题，同时便于模块化训练和针对各个子任务的定向优化。

2）通过独立查询布局元素的内容，该框架本身支持高度并行化的解码，从而显著提高了文档解析过程的整体推理吞吐量和效率。

Document parsing tasks, particularly those centered on Optical Character Recognition (OCR), are characterized by a high degree of output determinism. Unlike open-ended natural language generation, the tokens in document parsing are rigorously grounded in visual cues, and spatial dependencies between distant tokens are relatively sparse. Drawing inspiration from existing literature on efficient sequence generation [Ran et al., 2020, Chang et al., 2022, Cai et al., 2024], we leverage these properties to propose a dual-track parallelization paradigm: Token Parallelism and Query Parallelism.

Parallel Decoding Strategy

Token Parallelism

Candidate Generation: In each inference iteration, the model simultaneously predicts a block of N candidate tokens (up to 64) by extending the input prefix with specialized mask tokens.（不改变模型架构，只在Prompt中尾插掩码token实现多候选输出）

Original Sentence：Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding Assumption N =3, Decoding Phase：Youtu-Parsing: Perception [mask] [mask] [mask] 3个[mask]都基于**“Youtu-Parsing: Perception”**进行预测

Decoding Verification: The generated candidates are validated through a verification mechanism to ensure the output is identical to that of standard autoregressive decoding. This guarantees zero degradation in recognition accuracy while maintaining mathematical equivalence to the baseline.

为了增强模型基于当前token预测多候选token的能力，Youtu-Parsing进行了额外的预训练。

混合掩码训练（Hybrid Masked Training, HMT）策略

在微调阶段，80%的训练样本会被添加随机位置和长度的掩码，以促使模型捕捉多token的前瞻依赖关系。其余20%的样本则保持不掩码状态，以确保标准自回归性能的完整性。这一策略带来了5–11倍的经验加速，与理论加速结果一致$(S ≈k / 2)$，其中k表示每次迭代中接受的平均token数量。