GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

1Shanghai Artificial Intelligence Laboratory 2Zhejiang University
3School of Science and Engineering, The Chinese University of Hong Kong
4Fudan University 5Shanghai Innovation Institute

Corresponding author, *Project Leader

🔥[NEW!] GDI-Bench is a document-domain benchmark with a difficulty grading system that decouples visual and reasoning complexity for systematic model evaluation and optimization.
🔥[NEW!] GDI-Model is 8B-sized model that demonstrates strong general performance in the domain of document processing.
🔥[NEW!] Dataset & Model have been released!

GDI-Bench Overview

GDI-Bench: The benchmark decouples document understanding complexity into visual complexity (V0-V2) and reasoning complexity (R0-R2) dimensions, creating a comprehensive evaluation framework for assessing MLLMs' capabilities across various document types and reasoning tasks. Queries marked with a “CN” tag originate in Chinese and have been translated into English using Google Translate.

Abstract

The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench.

GDI-Bench and GDI-Model

To assist Multimodal Large Language Models (MLLMs) in locating their weaknesses within the document domain and to further guide model optimization, we first constructed a benchmark. GDI-Bench decouples task complexity into two distinct dimensions—visual complexity and reasoning complexity—and establishes a graded mechanism.

MY ALT TEXT

Complexity Distribution in the GDI Benchmark. The visual complexity dimension is operationalized through a hierarchical categorization of document images into three levels: V0 (plain text), V1 (formal representations), and V2 (explanatory representations). In parallel, the dimension of reasoning complexity is characterized by three categories: R0 (Full Page Structured Extraction), R1 (Information Extraction), and R2 (Reasoning).

After using GDI-Bench to identify the model's weaknesses, we employed the supervised fine-tuning (SFT) to enhance the model's performance. To address the catastrophic forgetting issue caused by SFT, we propose the Layer-wise Adaptive Freezing-Tuning (LW-AFT) method.

MY ALT TEXT

Overview of the Layer-wise Adaptive Freeze-Tuning method. The LW-AFT method freezes the majority of the model's parameters to preserve its general capabilities. It utilizes a specialized expert model, fine-tuned on a small subset of the dataset, to guide the freezing of parameters in each layer of the original model during the SFT process. Only domain-specific parameters are updated, thereby achieving efficient fine-tuning.

Experiments

MY ALT TEXT

The performance of different training methods on multiple datasets.

MY ALT TEXT

Performance of various open-source and closed-source models on GDI-Bench at different levels of reasoning complexity. The GDI-Model is fine-tuned based on the InternVL3-8B model.

MY ALT TEXT

The performance of different open-source and closed-source large models and different training methods on GDI-Bench.

Case Study !

Table Extraction
Newspaper Title Extraction
Question Extraction by Test Point
Question Extraction by Question Number
Author Extraction
Table QA
Chart Extraction