MAGMaR Workshop

Workshop Description

While information retrieval systems for text documents have been extensively studied for decades, the landscape has shifted; vast amounts of information today are stored as videos with minimal text metadata. For instance, as of January 2024, YouTube hosts over 14 billion videos. Despite the explosion of multimodal data, there remains a dearth of research around the efficient retrieval, processing, and synthesis of these massive multimodal collections. Existing systems largely still rely on text metadata (e.g., YouTube descriptions), overlooking the rich semantic content embedded within the multimodal data itself.

Individual research groups have independently begun addressing this challenge, leading to parallel yet disconnected efforts to define the research space. We are hosting a collaborative venue to unify these efforts and foster dialogue, which we believe is crucial for advancing the field. Our proposed workshop will focus on two primary areas: (1) the retrieval of multimodal content, which spans text, images, audio, video, and multimodal data (e.g., image-language, video-language); and (2) retrieval-augmented generation, with an emphasis on multimodal retrieval and generation. To further this goal, we will host a shared task on event-based video retrieval and understanding, designed to spark interest and facilitate research development in both retrieval and generation. This task’s primary retrieval metric, nDCG@10, will compare the final ranked lists of videos produced by participant systems.

The workshop will be co-located with ACL 2024 in Vienna, Austria on July 31st or August 1st, TBD.

Shared Task

Existing news video datasets focus primarily on English news broadcasts. To address this limitation, Sanders et al. introduced MultiVENT, a dataset of multilingual event-centric videos aligned with text documents in five target languages. However, both MultiVENT (2,400 videos) and MSR-VTT (10,000 videos) remain small compared to standard text retrieval collections; in comparison, HC4, the text corpus used in the 2022 NeuCLIR TREC shared task, contains approximately 6 million documents. To create a challenging and practically useful video retrieval task, we introduced MultiVENT 2.0, a collection containing over 217,000 videos. This includes 2,549 event-centric queries for a test collection of 109,800 videos (MultiVENT Test), capturing a diverse range of current events. Preliminary results show that this task poses significant challenges for current state-of-the-art vision-language models.

The shared task will focus on retrieving relevant visual content related to specific current events. Our goals are to evaluate the effectiveness of existing multimodal models, e.g., language models, for retrieving multilingual, event-based visual content; explore the contributions of different modalities to this task; and assess how retrieved content influences downstream generation results. Submitted systems will be evaluated in two ways. First, as a standard ranked information retrieval task, using established metrics from text-based retrieval, such as normalized Discounted Cumulative Gain (nDCG). Additionally, we also propose a pilot evaluation to assess each system's downstream effectiveness on retrieval-augmented generation, using a standard vision-language model (e.g., GPT-4V) along with both automatic metrics and human evaluations. Evaluation leaderboards will be hosted on Eval.ai.

While this workshop is being introduced for the first time, MultiVENT 2.0 was developed as the primary evaluation dataset for the 2024 Summer Camp for Applied Language Exploration, a 10-week workshop hosted by HLTCOE and Johns Hopkins University.

Details regarding how to submit to the shared task will be released soon.

Time	Program
9:30 - 9:45 am	Welcome Remarks Reno Kriz (Johns Hopkins University)
9:45 - 10:30 am	Keynote 1
10:30 - 11:00 am	Break
11:00 am - 12:30 pm	Oral Presentations (5 talks)
12:30 - 2:00 pm	Lunch
2:00 - 3:30 pm	Poster Session
3:30 - 4:00 pm	Break
4:00 - 4:45 pm	Keynote 2
4:45 - 5:00 pm	Paper Awards and Closing

Organizers

Reno Kriz rkriz1@jhu.edu
Research Scientist at the Human Language Technology Center of Excellence (HLTCOE) and Johns Hopkins University
Kenton Murray kenton@jhu.edu
Research Scientist at the HLTCOE and Johns Hopkins University
Eugene Yang eugene.yang@jhu.edu
Research Scientist at the HLTCOE and Johns Hopkins University
Francis Ferraro ferraro@umbc.edu
Associate Professor at University of Maryland, Baltimore County
Kate Sanders ksande25@jhu.edu
Ph.D. student at Johns Hopkins University
Cameron Carpenter ccarpe18@jhu.edu
Ph.D. student at Johns Hopkins University
Benjamin Van Durme vandurme@jhu.edu
Associate Professor at Johns Hopkins University and Principal Researcher at Microsoft

Program Committee

Dawn Lawrie
Senior Research Scientist at the HLTCOE
James Mayfield
Principal Research Scientist at Johns Hopkins University Applied Physics Laboratory (APL)
Paul McNamee
Principal Staff Scientist at APL
Kevin Duh
Senior Research Scientist at the HLTCOE
Will Gantt
Research Scientist at the HLTCOE
Dave Etter
Machine Learning Scientist at the HLTCOE

Tejas Gokhale
Assistant Professor at University of Maryland, Baltimore County

Andrew Yates
Assistant Professor at University of Amsterdam

Hamed Zamani
Associate Professor at University of Massachusetts Amherst

Arun Reddy
Ph.D. student at Johns Hopkins University and Research Engineer at APL

Alexander Martin
Ph.D. student at Johns Hopkins University

MAGMaR 2025

The 1st Workshop on Multimodal Augmented Generation via MultimodAl Retrieval

🔥

Dates

Description

Task

Schedule

Team

🔥

Important Dates

Workshop Description

Shared Task

Schedule

Organizers

Program Committee