ParaBank

The ParaBank project consists of a series of efforts exploring the potential for guided backtranslation for the purpose of paraphrasing with constraints. This work is spiritually connected to prior efforts at JHU in paraphrasing, in particular projects surrounding the ParaPhrase DataBase (PPDB).

The following are brief descriptions of projects under ParaBank, along with associated artifacts.

ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation

Abstract: We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that ParaBank's paraphrases improve over ParaNMT on both semantic similarity and fluency. Finally, we use ParaBank to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.

arXiv: https://arxiv.org/abs/1901.03644

ParaBank v1.0 Full (~9 GB)

ParaBank v1.0 Large, 50m pairs (~3 GB)

ParaBank v1.0 Small Diverse, 5m pairs

ParaBank v1.0 Large Diverse, 50m pairs

Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting

Abstract: Lexically-constrained sequence decoding allows for explicit positive or negative phrasebased constraints to be placed on target output strings in generation tasks such as machine translation or monolingual text rewriting. We describe vectorized dynamic beam allocation, which extends work in lexically-constrained decoding to work with batching, leading to a five-fold improvement in throughput when working with positive constraints. Faster decoding enables faster exploration of constraint strategies: we illustrate this via data augmentation experiments with a monolingual rewriter applied to the tasks of natural language inference, question answering and machine translation, showing improvements in all three.

https://www.aclweb.org/anthology/N19-1090

pMNLI : Paraphrase Augmentation of MNLI

Large-scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Abstract: Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

https://www.aclweb.org/anthology/K19-1005

ParaBank v2.0 (~2.3 GB)

Iterative Paraphrastic Augmentation with Discriminative Span Alignment

Abstract: We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.

TACL: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00380/100783/Iterative-Paraphrastic-Augmentation-with

Augmented FrameNet

Name: framenet-expanded-vers2.0.jsonlines.gz

This file contains an expanded 1,983,680-sentence version of FrameNet generated by applying 10 rounds of iterative paraphrastic augmentation to (almost all) of the roughly 200,000 sentences in the original resource. Each line is a JSON object with the following attributes:

frame_name: The frame to which this sentence belongs.
lexunit_compound_name: The lexical unit in the form of lemma.POS e.g. increase.n.
original_string: The raw FrameNet sentence.
original_trigger_offset: The character-level offset into the raw FrameNet sentence representing the trigger.
original_trigger: The string value of the trigger.
frame_id: The associated frame ID from FrameNet data release v1.7.
lexunit_id: The associated lexical unit ID from FrameNet data release v1.7.
exemplar_id: The associated exemplar ID from FrameNet data release v1.7.
annoset_id: The associated annotation set ID from FrameNet data release v1.7.
outputs: A list containing 10 items, each representing an automatically paraphrased and aligned sentence corresponding to the original FrameNet source sentence.

Each such item is of the form:

output_string: The tokenized automatically-generated paraphrase.
output_trigger_offset: The offset into the paraphrase representing the automatically aligned trigger.
output_trigger: The string value of the automatically aligned trigger in the paraphrase.
pbr_score: The negative log-likelihood of this paraphrase under the paraphrase model.
aligner_score: The probability of this alignment under the alignment model.
iteration: The iteration in which this output was generated (ranges between 1 and 10).
pclassifier_score: Probability of this output under a classifier trained to optimize for high precision of acceptable outputs.
rclassifier_score: Probability of this output under a classifier trained to optimize for high recall of acceptable outputs.

The pclassifier_score may be used to select a smaller, higher quality subset of the full dataset whereas the rclassifier_score may be used to obtain a larger but slightly lower quality subset.

Alignment Dataset

Name: alignment-release.jsonlines.gz

This file contains a 36,417-instance manually annotated dataset for monolingual span alignment. Each data point consists of a natural-language sentence (the source), a span in that sentence, an automatically generated paraphrase (the reference), and a span in the reference with the same meaning as the source-side span. All source sentences are taken from FrameNet v1.7.

Each line is a JSON object with the following attributes:

source_bert_toks: The tokenized source sentence.
source_bert_span: Offset into the source sentence representing a span.
`referencespacytokens``: The tokenized reference sentence.
reference_span: Offset into the reference sentence representing a span.
has_corres: Boolean value representing whether the reference sentence contains a span that corresponds in meaning to the source-side span.
exemplar_id: The associated exemplar ID from FrameNet data release v1.7.