TextAtlas5M is the first dense text image dataset.
🔥[2025-02-19] We update the evaluation script of TextAtlasEval to evaluate model's ability on dense text generation. See GITHUB for more details. 🚀
🔥[2025-02-12] We introduce TextAtlas5M, a dataset specifically designed for training and evaluating multimodal generation models on dense-text image generation. 🚀
TextAtlas5M focus on generating dense-text images and stands out in several key ways compared to previous text-rich datasets. Unlike earlier datasets, which primarily focus on short and simple text, TextAtlas5M includes a diverse and complex range of data. It spans from interleaved documents and synthetic data to real-world images containing dense text, offering a more varied and challenging set of examples. Moreover, our dataset features longer text captions, which pose additional challenges for models, and includes human annotations for particularly difficult examples, ensuring a more thorough evaluation of model capabilities.
We design a dedicated test benchmark TextAtlasEval to address the longstanding gap in metrics for evaluating long-text information in image generation. By requiring models to effectively process and generate longer text, TextAtlas5M sets itself apart from existing text rendering benchmarks.
We thoroughly evaluate proprietary and open-source models to assess their long-text generation capabilities. The results reveal the significant challenges posed by TextAtlas5M and provide valuable insights into the limitations of current models, offering key directions for advancing text-rich image generation in future research.
The synthetic subset progresses through three levels of complexity, starting with simple text on clean backgrounds. It then advances to interleaved data, blending text with visual elements, and culminates in synthetic natural images, where realistic scenes integrate seamlessly with text.
The real image subset captures diverse, real-world dense text scenarios. It includes filtered samples from datasets like AnyText and TextDiffuser, detailed descriptions from PowerPoint slides, book covers, and academic PDF papers.
To enrich diversity, we also gather dense text images guided by predefined topics. To assess the capability of models in dense text image generation, we introduce a dedicated test set, TextAtlas5MEval, designed for comprehensive evaluation. This test set spans four distinct data types, ensuring diversity across domains and enhancing the relevance of TextAtlas5M for real-world applications.
Topic distribution in StyledTextSynth and TextScenesHQ subset, showcasing a diverse range of text-rich topics such as weather reports, banners, and TV shopping ads. StyledTextSynth includes carefully selected 18 topics, while TextScenesHQ ultimately contains 26 distinct topics. These topics are generated using GPT-4 as a world simulator and then filtered by humans to eliminate overlap while ensuring diversity.
Dataset Comparison with Existing Text-Rich Image Generation Datasets. The last two columns detail the sources of automatically generated labels, while the final column presents the average text token length derived from OCR applied to the images.
Kernel density estimations representing the distribution of perplexity scores for TextAtlas5M compared to reference datasets. The lower the perplexity for a document, the more it resembles a Wikipedia article.
Data Level, Datasets, and Annotations Overview.
CLIP Score Distribution.
To evaluate the dense-text image generation ability for existing model, we further propose TextAtlas5MEval. Adopting stratified random sampling weighted by subset complexity levels: 33% from advanced synthetic tiers (Styled-TextSynth), 33% from real-world professional domains TextScenesHQ, and 33% from web-sourced interleaved TextVisionBlend coverage of both controlled and organic scenarios.
For the StyledTextSynth and TextScenesHQ subsets, we sample data from each topic to ensure the evaluation set covers a wide range of topics. For TextVisionBlend we perform random sampling to obtain the samples, and we finally get a test set with 3000 samples. In this way, our dataset cover different domain of data which allowing us to assess the capabilities of model across multiple dimensions.
To further evaluate the model performance, we evaluate FID score, CLIP Score, OCR accuracy and show the results in the following table. We observe the SD-3.5 leads to the best result.
Long text image generation evaluation over TextAtlas5MEval. Metrics include F1 score (F1), CLIP Score (CS), and Character Error Rate (CER).
Evaluation On TextVisionBlend
Method | Date | FID⬇ | CS⬆ | OCR(Acc.)⬆ | OCR(F1.)⬆ | OCR(Cer.)⬇ |
---|---|---|---|---|---|---|
PixArt-Sigma | 2024-03-27 | 81.29 | 0.1891 | 2.40 | 1.57 | 0.83 |
Infinity-2B | 2024-12-05 | 95.69 | 0.1979 | 2.98 | 3.44 | 0.83 |
SD3.5 Large | 2024-06-25 | 118.85 | 0.1846 | 14.55 | 16.25 | 0.88 |
Evaluation On StyledTextSynth
Method | Date | FID⬇ | CS⬆ | OCR(Acc.)⬆ | OCR(F1.)⬆ | OCR(Cer.)⬇ |
---|---|---|---|---|---|---|
SD3.5 Large | 2024-06-25 | 71.09 | 0.2849 | 27.21 | 33.86 | 0.73 |
PixArt-Sigma | 2024-03-27 | 82.83 | 0.2764 | 0.42 | 0.62 | 0.90 |
Infinity-2B | 2024-12-05 | 84.95 | 0.2727 | 0.80 | 1.42 | 0.93 |
Anytext | 2023-11-03 | 117.71 | 0.2501 | 0.35 | 0.66 | 0.98 |
TextDiffuser2 | 2023-11-16 | 114.31 | 0.2510 | 0.76 | 1.46 | 0.99 |
Evaluation On TextScenesHQ
Method | Date | FID⬇ | CS⬆ | OCR(Acc.)⬆ | OCR(F1.)⬆ | OCR(Cer.)⬇ |
---|---|---|---|---|---|---|
SD3.5 Large | 2024-06-25 | 64.44 | 0.2363 | 19.03 | 24.45 | 0.73 |
Infinity-2B | 2024-12-05 | 71.59 | 0.2346 | 1.06 | 1.74 | 0.88 |
PixArt-Sigma | 2024-03-27 | 72.62 | 0.2347 | 0.34 | 0.53 | 0.91 |
TextDiffuser2 | 2023-11-16 | 84.10 | 0.2252 | 0.66 | 1.25 | 0.96 |
Anytext | 2023-11-03 | 101.32 | 0.2174 | 0.42 | 0.80 | 0.95 |
TextScenesHQ Examples_1
TextScenesHQ Examples_2
TextScenesHQ Examples_3
StyledTextSynth Examples_1
StyledTextSynth Examples_2
StyledTextSynth Examples_3
@inproceedings{wang2025large,
title={A Large-scale Dataset for Dense Text Image Generation},
author={Alex Jinpeng Wang and Dongxing Mao and Jiawei Zhang and weiming Han and Zhuobai Dong and Linjie Li and Yiqi Lin and Zhengyuan Yang and Libo Qin and Fuwei Zhang and Lijuan Wang and Min Li},
booktitle={arXiv preprint arXiv: 2502.07870},
year={2025},
}