OmniCaptioner: One Captioner to Rule Them All!

1Shanghai Artificial Intelligence Laboratory 2University of Science and Technology of China
3Fudan University 4The Chinese University of Hong Kong

*Equal contribution, πŸ“§Corresponding author

πŸ”₯[NEW!] OmniCaptioner demonstrates powerful caption ability on various visual domains!
πŸ”₯[NEW!] Weights & Code have been released!

MY ALT TEXT

OmniCaptioner: the top section demonstrates its capability to process diverse visual domains. The bottom section highlights itsapplications in visual reasoning (associated with reasoning LLM), image generation (integrated with T2I generation models), and efficient downstream SFT tasks adaptation.

Abstract

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

OmniCaptioner

To achieve a unified multimodal pretraining paradigm and handle diverse visual domains, we first construct a diverse caption dataset.

MY ALT TEXT

OmniCaptioner's diverse visual captioning pipeline. The pipeline consists of Seed-Caption Generation and Caption Extension. OmniCaptioner utilizes a 21M-caption dataset, covering diverse domains beyond natural images, enabling more comprehensive captioning capabilities.

After a unified pretraining process, OmniCaptioner can effectively adapt to diverse downstream tasks including (i) Improved Visual Reasoning Tasks with LLMs, (ii) Enhanced Image Generation and Conversion, and (iii) Efficient SFT Process.

MY ALT TEXT

Illustration of OmniCaptioner's plug-and-play applications (Sub-figure a, b) and comparison between OmniCaptioner and LLava-OneVision-7B on non-natural image captioning (Sub-figure c).

Experiments

MY ALT TEXT

Performance comparison on various visual benchmarks between our OmniCaptioner-inserted LLMs and previous SOTA MLLMs.

MY ALT TEXT

Performance comparison of models trained with different captioners on GenEval.

MY ALT TEXT

SFT performance comparison across diverse evaluation benchmarks.

MY ALT TEXT

Performance comparison across various benchmarks for different LLMs/MLLMs (7B), with or without visual input.

Rule Them All!

Natural Image
Table
Chart
Flow Chart
Poster
Equation
Geometry
UI
PDF
Video