GDI-Bench

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

¹Shanghai Artificial Intelligence Laboratory ²Zhejiang University
³School of Science and Engineering, The Chinese University of Hong Kong
⁴Fudan University ⁵Shanghai Innovation Institute
^†Corresponding author, ^*Project Leader

Abstract

The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench.

GDI-Bench and GDI-Model

To assist Multimodal Large Language Models (MLLMs) in locating their weaknesses within the document domain and to further guide model optimization, we first constructed a benchmark. GDI-Bench decouples task complexity into two distinct dimensions—visual complexity and reasoning complexity—and establishes a graded mechanism.

MY ALT TEXT

Complexity Distribution in the GDI Benchmark. The visual complexity dimension is operationalized through a hierarchical categorization of document images into three levels: V0 (plain text), V1 (formal representations), and V2 (explanatory representations). In parallel, the dimension of reasoning complexity is characterized by three categories: R0 (Full Page Structured Extraction), R1 (Information Extraction), and R2 (Reasoning).

After using GDI-Bench to identify the model's weaknesses, we employed the supervised fine-tuning (SFT) to enhance the model's performance. To address the catastrophic forgetting issue caused by SFT, we propose the Layer-wise Adaptive Freezing-Tuning (LW-AFT) method.

MY ALT TEXT