More
Choose
About CausalLM

A non-profit research initiative advancing the frontiers of artificial intelligence. We focus on omni-modal AI systems, efficient architectures, and synthetic data at scale.

Retrievatar: A Multimodal Dataset for Entity-Centric Retrieval-Augmented Generation
Category:  Datasets
Date:  
Author:  CausalLM

Retrievatar is a multimodal dataset designed to enhance the retrieval-augmented generation capabilities of vision-language models, specifically focusing on fictional anime characters and real-world celebrities across various fields. This release represents a subset of 100,000 samples extracted from a significantly larger synthetic image-text corpus. The dataset is being open-sourced to facilitate further research into entity-centric multimodal understanding, with plans to evaluate and potentially release additional thematic subsets in the future.

Data Construction and Methodology

The image captions within this dataset were generated using the Gemini-2.5-pro GA model, leveraging Grounding with Google Search via the Gemini API. The generation process involved a comprehensive input strategy where the model was provided with the source image along with extensive metadata. This metadata included intrinsic image information and contextual content derived from reverse image search web results. By utilizing search-grounded generation, the resulting captions offer a high degree of factual accuracy and contextual richness that goes beyond simple visual description.

Motivation and Problem Statement

The primary objective of Retrievatar is to mitigate the limitations found in traditional Vision-Language Model training, which often relies on hard matching between an individual's name and their visual avatar. Such rigid associations frequently lead to downstream models that lack a sufficient understanding of the entity's background, creating a disconnect between the task of linking an identity to information and the task of linking a face to a name. Retrievatar addresses this by providing data that bridges these tasks, fostering a more holistic representation of both fictional and real-world figures.

Languages and Temporal Context

The dataset features multilingual captions to support diverse research applications, including English, Chinese, Japanese, and German. Researchers should be aware that the synthetic data construction was completed in August 2025. Consequently, the information contained within the captions and metadata reflects the state of the web at that time and may not capture the most recent developments or changes regarding the subjects depicted.