






The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench.
To assist Multimodal Large Language Models (MLLMs) in locating their weaknesses within the document domain and to further guide model optimization, we first constructed a benchmark. GDI-Bench decouples task complexity into two distinct dimensions—visual complexity and reasoning complexity—and establishes a graded mechanism.
After using GDI-Bench to identify the model's weaknesses, we employed the supervised fine-tuning (SFT) to enhance the model's performance. To address the catastrophic forgetting issue caused by SFT, we propose the Layer-wise Adaptive Freezing-Tuning (LW-AFT) method.