We have a Slack channel for authors and participants! Please click here to receive an invite. This will be our primary method of communication for questions, etc.
We welcome the submission of novel papers to the 1st Workshop on Multimodal Augmented Generation via MultimodAl Retrieval centering on topics including (but not limited to):
We invite both 8-page and 4-page submissions with unlimited pages for references. Camera-ready versions of papers will be allowed an extra page of content to address reviewer concerns.
Submissions should follow the ACL guidelines. Archival and non-archival submissions are permitted, and all papers submitted before April 1, 2025 will be considered. We are using OpenReview to process submissions, linked here.
While information retrieval systems for text documents have been extensively studied for decades, the landscape has shifted; vast amounts of information today are stored as videos with minimal text metadata. For instance, as of January 2024, YouTube hosts over 14 billion videos. Despite the explosion of multimodal data, there remains a dearth of research around the efficient retrieval, processing, and synthesis of these massive multimodal collections. Existing systems largely still rely on text metadata (e.g., YouTube descriptions), overlooking the rich semantic content embedded within the multimodal data itself.
Individual research groups have independently begun addressing this challenge, leading to parallel yet disconnected efforts to define the research space. We are hosting a collaborative venue to unify these efforts and foster dialogue, which we believe is crucial for advancing the field. Our proposed workshop will focus on two primary areas: (1) the retrieval of multimodal content, which spans text, images, audio, video, and multimodal data (e.g., image-language, video-language); and (2) retrieval-augmented generation, with an emphasis on multimodal retrieval and generation. To further this goal, we will host a shared task on event-based video retrieval and understanding, designed to spark interest and facilitate research development in both retrieval and generation. This task’s primary retrieval metric, nDCG@10, will compare the final ranked lists of videos produced by participant systems.
The workshop will be co-located with ACL 2024 in Vienna, Austria on July 31st or August 1st, TBD.
Existing news video datasets focus primarily on English news broadcasts. To address this limitation, Sanders et al. introduced MultiVENT, a dataset of multilingual event-centric videos aligned with text documents in five target languages. However, both MultiVENT (2,400 videos) and MSR-VTT (10,000 videos) remain small compared to standard text retrieval collections; in comparison, HC4, the text corpus used in the 2022 NeuCLIR TREC shared task, contains approximately 6 million documents. To create a challenging and practically useful video retrieval task, we introduced MultiVENT 2.0, a collection containing over 217,000 videos. This includes 2,549 event-centric queries for a test collection of 109,800 videos (MultiVENT Test), capturing a diverse range of current events. Preliminary results show that this task poses significant challenges for current state-of-the-art vision-language models.
The shared task will focus on retrieving relevant visual content related to specific current events. Our goals are to evaluate the effectiveness of existing multimodal models, e.g., language models, for retrieving multilingual, event-based visual content; explore the contributions of different modalities to this task; and assess how retrieved content influences downstream generation results. Submitted systems will be evaluated in two ways. First, as a standard ranked information retrieval task, using established metrics from text-based retrieval, such as normalized Discounted Cumulative Gain (nDCG). Additionally, we also propose a pilot evaluation to assess each system's downstream effectiveness on retrieval-augmented generation, using a standard vision-language model (e.g., GPT-4V) along with both automatic metrics and human evaluations. Evaluation leaderboards will be hosted on Eval.ai.
While this workshop is being introduced for the first time, MultiVENT 2.0 was developed as the primary evaluation dataset for the 2024 Summer Camp for Applied Language Exploration, a 10-week workshop hosted by HLTCOE and Johns Hopkins University.
Details regarding how to submit to the shared task will be released soon.
This will be a one-day hybrid workshop to allow remote participation. The morning session will feature our first invited speaker, followed by selected oral paper presentations. In the afternoon, additional speaker presentations will precede an overview of the shared task and results. The day will conclude with oral presentations of shared task submissions, paper and shared task awards, and a final poster session.