RAMS is the dataset associated with the paper Multi-Sentence Argument Linking. It contains 9,124 annotated events from news based on an ontology of 139 event types and 65 roles. In a 5-sentence window around each event trigger, we annotate the closest span for each role. Our code and models are available,* as well as our slides for the paper.
Download RAMS 1.0 [current version (March 2023), tar.gz (4.5MB)]
Or, view a Single Example.
* We fixed a bug that affects the performance on the Beyond NomBank (BNB) dataset. See the BNB documentation for details.
The data is split into train/dev/test files. Each line in a data file contains a json string. Each json contains:
ent_spans
: Start and end (inclusive) indices and an event/argument/role string.evt_triggers
: Start and end (inclusive) indices and an event type string.sentences
: Document textgold_evt_links
: A triple of (event, argument, role) following the above formatsource_url
: source of the textsplit
: Which data split it belongs todoc_key
: Which individual file it corresponds to (nw_
is prepended to all of
them)
All other fields are extraneous to allow for future iterations of RAMS.
A scorer is released alongside the data.
The basic use of the scorer is below:
python scorer.py --gold_file <GOLD> --pred_file <PRED> --ontology_file <ONTOLOGY> --do_all
Some notes:
<PRED>
can be in one of two formats. In both cases, it contains one json string
per line, and that json blob must contain a doc_key
.
gold_evt_links
key, like in the gold data. Add the --reuse_gold_format
flag when running the scorer.
predictions
, as in this example: "predictions":
[[[70, 70], [63, 63, "victim", 1.0], [58, 58, "place", 1.0]]]
It
is a list of event-predictions (in RAMS there is only one). Each event-prediction starts
with a [start, end] (inclusive) span for the event at index 0, and a [start, end, label,
confidence] at subsequent indices for each argument.
<ONTOLOGY>
is used for type constrained decoding. a tsv where the 0th column is
the event name, and the (2i + 1)th column is the role name and (2i + 2)th column is the
count that is permitted by the event.
-cd
for type constrained decoding. Otherwise it is not on by default.--do_all
prints out metrics (--metrics
), metrics by distance (distance
),
metrics by role (role_table
), and csv confusion matrix (confusion
).
Individual metrics can be printed with their own flags (in parens).
Please contact us if you want to obtain the older versions of the data. We encourage you to use the current version.
RAMS_1.0c.tar.gz [current as of March 2023, 4.5MB] is the current version with the scorer corrected. Please refer to this as RAMS 1.0. While we had previously noticed that the scorer swapped precision and recall, we failed to copy the fixed scorer into previous releases. The data itself has not changed, and all reported F1 are still correct.
RAMS_1.0b.tar.gz [4.5MB] includes LICENSE information as of July 2020 (see below).
RAMS_1.0.tar.gz [4.5MB] contains the same data as the current version above. The scorer incorrectly reported precision and recall.
RAMS_0.9.tar.gz [10MB] was used in an earlier version of the paper and contains human readable files. We found some overlap between splits. The current version re-split and re-released a dataset without overlap.
@inproceedings{ebner-etal-2020-multi,
title={Multi-Sentence Argument Linking},
author={Seth Ebner and Patrick Xia and Ryan Culkin and Kyle Rawlins and Benjamin {Van Durme}},
year={2020},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
}
RAMS 1.0 consists of annotations against paragraph-sized examples drawn from articles distributed publicly on the internet.
We do not own that text nor claim copyright: examples drawn from these articles are meant for research use in algorithmic design.
We release our annotations of the underlying text under CC-BY-SA-4.0.
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
And contact the authors.
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the annotations.