The ParaBank project consists of a series of efforts exploring the potential for guided backtranslation for the purpose of paraphrasing with constraints. This work is spiritually connected to prior efforts at JHU in paraphrasing, in particular projects surrounding the ParaPhrase DataBase (PPDB).

The following are brief descriptions of projects under ParaBank, along with associated artifacts.

ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation

Abstract: We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that ParaBank's paraphrases improve over ParaNMT on both semantic similarity and fluency. Finally, we use ParaBank to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.


ParaBank v1.0 Full (~9 GB)

ParaBank v1.0 Large, 50m pairs (~3 GB)

ParaBank v1.0 Small Diverse, 5m pairs

ParaBank v1.0 Large Diverse, 50m pairs

Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting

Abstract: Lexically-constrained sequence decoding allows for explicit positive or negative phrasebased constraints to be placed on target output strings in generation tasks such as machine translation or monolingual text rewriting. We describe vectorized dynamic beam allocation, which extends work in lexically-constrained decoding to work with batching, leading to a five-fold improvement in throughput when working with positive constraints. Faster decoding enables faster exploration of constraint strategies: we illustrate this via data augmentation experiments with a monolingual rewriter applied to the tasks of natural language inference, question answering and machine translation, showing improvements in all three.

pMNLI : Paraphrase Augmentation of MNLI

Large-scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Abstract: Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

ParaBank v2.0 (~2.3 GB)

Iterative Paraphrastic Augmentation with Discriminative Span Alignment

Abstract: We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.


Augmented FrameNet

Name: framenet-expanded-vers2.0.jsonlines.gz

This file contains an expanded 1,983,680-sentence version of FrameNet generated by applying 10 rounds of iterative paraphrastic augmentation to (almost all) of the roughly 200,000 sentences in the original resource. Each line is a JSON object with the following attributes:

Each such item is of the form:

The pclassifier_score may be used to select a smaller, higher quality subset of the full dataset whereas the rclassifier_score may be used to obtain a larger but slightly lower quality subset.

Alignment Dataset

Name: alignment-release.jsonlines.gz

This file contains a 36,417-instance manually annotated dataset for monolingual span alignment. Each data point consists of a natural-language sentence (the source), a span in that sentence, an automatically generated paraphrase (the reference), and a span in the reference with the same meaning as the source-side span. All source sentences are taken from FrameNet v1.7.

Each line is a JSON object with the following attributes: