Multilingual Multiparty Coreference Resolution (MMC)

MMC is the dataset of multilingual multiparty coreference resolution. It is associated with the paper

Multilingual Coreference Resolution in Multiparty Dialogue. It is based on transcripts and subtitles from The Big Bang Theory and Friends. It includes 1,222 multiparty dialogue scenes annotated with coreference in English, 1,215 in Farsi, and 1,215 in Chinese.

You can download it here [.zip (5.5MB)].

The data contains three langugage: English, Farsi and Chinese. Each dataset is split into train/dev/test and stored in conll format. Each file is named as {split}.{language}.v4_gold_conll. There are two file for Chinese test set. test.chinese.v4_gold_conll is automatic projection from English data. test_corrected.chinese.v4_gold_conll human corrected projection data.

To cite:

@misc{zheng-etal-2022-multilingual,
url = {https://arxiv.org/abs/2208.01307},
author = {Zheng, Boyuan and Xia, Patrick and Yarmohammadi, Mahsa and Van Durme, Benjamin},
title = {Multilingual Coreference Resolution in Multiparty Dialogue},
publisher = {arXiv},
year = {2022},
}