TextAtlas
A Large-scale Dataset for Dense Text Image Generation

TextAtlas5M is the first dense text image dataset.

our observation

Current text-to-image generation methods struggle with rendering dense text. For example, AnyText and Text Diffuser2, are capable of rendering short text but struggle with longer sequences. GPT4o with DALLE-3 and SD3.5 Large show superior performance, though they still produce inaccuracies such as duplicated words or missing letters when handling extended text. For interleaved documents, all methods perform poorly due to their lack of layout planning capabilities.
These results underscore that dense-text image generation remains a challenging task for current models.

🔔News

🔥[2025-02-19] We update the evaluation script of TextAtlasEval to evaluate model's ability on dense text generation. See GITHUB for more details. 🚀

🔥[2025-02-12] We introduce TextAtlas5M, a dataset specifically designed for training and evaluating multimodal generation models on dense-text image generation. 🚀


Introduction

TextAtlas5M focus on generating dense-text images and stands out in several key ways compared to previous text-rich datasets. Unlike earlier datasets, which primarily focus on short and simple text, TextAtlas5M includes a diverse and complex range of data. It spans from interleaved documents and synthetic data to real-world images containing dense text, offering a more varied and challenging set of examples. Moreover, our dataset features longer text captions, which pose additional challenges for models, and includes human annotations for particularly difficult examples, ensuring a more thorough evaluation of model capabilities.

algebraic reasoning

We design a dedicated test benchmark TextAtlasEval to address the longstanding gap in metrics for evaluating long-text information in image generation. By requiring models to effectively process and generate longer text, TextAtlas5M sets itself apart from existing text rendering benchmarks.

We thoroughly evaluate proprietary and open-source models to assess their long-text generation capabilities. The results reveal the significant challenges posed by TextAtlas5M and provide valuable insights into the limitations of current models, offering key directions for advancing text-rich image generation in future research.

algebraic reasoning

TextAtlas5M (Training Dataset)

Overview

algebraic reasoning

Synthetic Data

CleanTextSynth, TextVisionBlend, StyledTextSynth

The synthetic subset progresses through three levels of complexity, starting with simple text on clean backgrounds. It then advances to interleaved data, blending text with visual elements, and culminates in synthetic natural images, where realistic scenes integrate seamlessly with text.

Real Data

The other subsets

The real image subset captures diverse, real-world dense text scenarios. It includes filtered samples from datasets like AnyText and TextDiffuser, detailed descriptions from PowerPoint slides, book covers, and academic PDF papers.

To enrich diversity, we also gather dense text images guided by predefined topics. To assess the capability of models in dense text image generation, we introduce a dedicated test set, TextAtlas5MEval, designed for comprehensive evaluation. This test set spans four distinct data types, ensuring diversity across domains and enhancing the relevance of TextAtlas5M for real-world applications.

Subset Example

algebraic reasoning

Statistics

TextAtlasEval (Evaluate Benchmark)

To evaluate the dense-text image generation ability for existing model, we further propose TextAtlas5MEval. Adopting stratified random sampling weighted by subset complexity levels: 33% from advanced synthetic tiers (Styled-TextSynth), 33% from real-world professional domains TextScenesHQ, and 33% from web-sourced interleaved TextVisionBlend coverage of both controlled and organic scenarios.


For the StyledTextSynth and TextScenesHQ subsets, we sample data from each topic to ensure the evaluation set covers a wide range of topics. For TextVisionBlend we perform random sampling to obtain the samples, and we finally get a test set with 3000 samples. In this way, our dataset cover different domain of data which allowing us to assess the capabilities of model across multiple dimensions.

Leaderboard

To further evaluate the model performance, we evaluate FID score, CLIP Score, OCR accuracy and show the results in the following table. We observe the SD-3.5 leads to the best result.


Long text image generation evaluation over TextAtlas5MEval. Metrics include F1 score (F1), CLIP Score (CS), and Character Error Rate (CER).

Evaluation On TextVisionBlend

Method Date FID⬇ CS⬆ OCR(Acc.)⬆ OCR(F1.)⬆ OCR(Cer.)⬇
PixArt-Sigma 2024-03-27 81.29 0.1891 2.40 1.57 0.83
Infinity-2B 2024-12-05 95.69 0.1979 2.98 3.44 0.83
SD3.5 Large 2024-06-25 118.85 0.1846 14.55 16.25 0.88
arithmetic reasoning

Evaluation On StyledTextSynth

Method Date FID⬇ CS⬆ OCR(Acc.)⬆ OCR(F1.)⬆ OCR(Cer.)⬇
SD3.5 Large 2024-06-25 71.09 0.2849 27.21 33.86 0.73
PixArt-Sigma 2024-03-27 82.83 0.2764 0.42 0.62 0.90
Infinity-2B 2024-12-05 84.95 0.2727 0.80 1.42 0.93
Anytext 2023-11-03 117.71 0.2501 0.35 0.66 0.98
TextDiffuser2 2023-11-16 114.31 0.2510 0.76 1.46 0.99
arithmetic reasoning

Evaluation On TextScenesHQ

Method Date FID⬇ CS⬆ OCR(Acc.)⬆ OCR(F1.)⬆ OCR(Cer.)⬇
SD3.5 Large 2024-06-25 64.44 0.2363 19.03 24.45 0.73
Infinity-2B 2024-12-05 71.59 0.2346 1.06 1.74 0.88
PixArt-Sigma 2024-03-27 72.62 0.2347 0.34 0.53 0.91
TextDiffuser2 2023-11-16 84.10 0.2252 0.66 1.25 0.96
Anytext 2023-11-03 101.32 0.2174 0.42 0.80 0.95
arithmetic reasoning

More Examples

Potential Applications

BibTeX


        @inproceedings{wang2025large,
            title={A Large-scale Dataset for Dense Text Image Generation},
            author={Alex Jinpeng Wang and Dongxing Mao and  Jiawei Zhang and weiming Han and Zhuobai Dong and Linjie Li and Yiqi Lin and Zhengyuan Yang and Libo Qin and Fuwei Zhang and Lijuan Wang and Min Li},
            booktitle={arXiv preprint arXiv: 2502.07870},
            year={2025},
        }