OmniCaptioner

OmniCaptioner: One Captioner to Rule Them All!

¹Shanghai Artificial Intelligence Laboratory ²University of Science and Technology of China
³Fudan University ⁴The Chinese University of Hong Kong

^*Equal contribution, ^📧Corresponding author

Abstract

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

OmniCaptioner

To achieve a unified multimodal pretraining paradigm and handle diverse visual domains, we first construct a diverse caption dataset.

MY ALT TEXT

OmniCaptioner's diverse visual captioning pipeline. The pipeline consists of Seed-Caption Generation and Caption Extension. OmniCaptioner utilizes a 21M-caption dataset, covering diverse domains beyond natural images, enabling more comprehensive captioning capabilities.

After a unified pretraining process, OmniCaptioner can effectively adapt to diverse downstream tasks including (i) Improved Visual Reasoning Tasks with LLMs, (ii) Enhanced Image Generation and Conversion, and (iii) Efficient SFT Process.

MY ALT TEXT

Illustration of OmniCaptioner's plug-and-play applications (Sub-figure a, b) and comparison between OmniCaptioner and LLava-OneVision-7B on non-natural image captioning (Sub-figure c).

Experiments

MY ALT TEXT

Performance comparison on various visual benchmarks between our OmniCaptioner-inserted LLMs and previous SOTA MLLMs.

MY ALT TEXT

Performance comparison of models trained with different captioners on GenEval.

MY ALT TEXT

SFT performance comparison across diverse evaluation benchmarks.

MY ALT TEXT

Performance comparison across various benchmarks for different LLMs/MLLMs (7B), with or without visual input.

Rule Them All!

OmniCaptioner: One Captioner to Rule Them All!

🔥[NEW!] OmniCaptioner demonstrates powerful caption ability on various visual domains!
🔥[NEW!] Weights & Code have been released!

OmniCaptioner: the top section demonstrates its capability to process diverse visual domains. The bottom section highlights itsapplications in visual reasoning (associated with reasoning LLM), image generation (integrated with T2I generation models), and efficient downstream SFT tasks adaptation.

Abstract

OmniCaptioner

OmniCaptioner's diverse visual captioning pipeline. The pipeline consists of Seed-Caption Generation and Caption Extension. OmniCaptioner utilizes a 21M-caption dataset, covering diverse domains beyond natural images, enabling more comprehensive captioning capabilities.

Illustration of OmniCaptioner's plug-and-play applications (Sub-figure a, b) and comparison between OmniCaptioner and LLava-OneVision-7B on non-natural image captioning (Sub-figure c).

Experiments

Performance comparison on various visual benchmarks between our OmniCaptioner-inserted LLMs and previous SOTA MLLMs.

Performance comparison of models trained with different captioners on GenEval.

SFT performance comparison across diverse evaluation benchmarks.

Performance comparison across various benchmarks for different LLMs/MLLMs (7B), with or without visual input.

Rule Them All!

OmniCaptioner: One Captioner to Rule Them All!

🔥[NEW!] OmniCaptioner demonstrates powerful caption ability on various visual domains! 🔥[NEW!] Weights & Code have been released!

OmniCaptioner: the top section demonstrates its capability to process diverse visual domains. The bottom section highlights itsapplications in visual reasoning (associated with reasoning LLM), image generation (integrated with T2I generation models), and efficient downstream SFT tasks adaptation.

Abstract

OmniCaptioner

OmniCaptioner's diverse visual captioning pipeline. The pipeline consists of Seed-Caption Generation and Caption Extension. OmniCaptioner utilizes a 21M-caption dataset, covering diverse domains beyond natural images, enabling more comprehensive captioning capabilities.

Illustration of OmniCaptioner's plug-and-play applications (Sub-figure a, b) and comparison between OmniCaptioner and LLava-OneVision-7B on non-natural image captioning (Sub-figure c).

Experiments

Performance comparison on various visual benchmarks between our OmniCaptioner-inserted LLMs and previous SOTA MLLMs.

Performance comparison of models trained with different captioners on GenEval.

SFT performance comparison across diverse evaluation benchmarks.

Performance comparison across various benchmarks for different LLMs/MLLMs (7B), with or without visual input.

Rule Them All!

🔥[NEW!] OmniCaptioner demonstrates powerful caption ability on various visual domains!
🔥[NEW!] Weights & Code have been released!