Dataset Descriptions

Contents


Bioinformatics Datasets

arxiv1810.00826-t1_excerpt.png

[Image source. Click image to open in new window.]


  • MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds with 7 discrete labels. See also Classification Statistics on the MUTAG, ENZYMES Datasets

  • PROTEINS is a dataset where nodes are secondary structure elements (SSEs) and there is an edge between two nodes if they are neighbors in the amino-acid sequence or in 3D space. It has 3 discrete labels, representing helix, sheet or turn.

  • PTC is a dataset of 344 chemical compounds that reports the carcinogenicity for male and female rats and it has 19 discrete labels.

  • NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets of chemical compounds screened for ability to suppress or inhibit the growth of a panel of human tumor cell lines, having 37 discrete labels.

  • Reference, datasets: Benchmark Data Sets for Graph Kernels (2016)

  • Used in: How Powerful are Graph Neural Networks?


PPI (Protein-Protein Interaction) is a Homo Sapiens PPI network where each label corresponds to a biological state.

  • Source: BioGRID. BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.5.166 and searches 67,477 publications for 1,623,645 protein and genetic interactions, 28,093 chemical associations and 726,378 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.

  • Yoshua Bengio and colleagues describe this dataset in Graph Attention Networks (Feb 2018): “We make use of a protein-protein interaction (PPI) dataset that consists of graphs corresponding to different human tissues (Zitnik & Leskovec, 2017). The dataset contains 20 graphs for training, 2 for validation and 2 for testing. Critically, testing graphs remain completely unobserved during training. To construct the graphs, we used the preprocessed data provided by Hamilton et al. (2017). The average number of nodes per graph is 2,372. Each node has 50 features that are composed of positional gene sets, motif gene sets and immunological signatures. There are 121 labels for each node set from gene ontology, collected from the Molecular Signatures Database (Subramanian et al., 2005), and a node can possess several labels simultaneously.”

    arxiv1710.10903-t1-PPI.png

    [Image source. Click image to open in new window.]


    arxiv1809.10341-t1-PPI.png

    [Image source. Click image to open in new window.]


  • Grover and Leskovec, node2vec: Scalable Feature Learning for Networks (Jul 2016) used a subgraph of the PPI network for Homo sapiens.  The subgraph corresponded to the graph induced by nodes for which they could obtain labels from the hallmark gene sets and represent biological states. The subnetwork had 3,890 nodes, 76,584 edges, and 50 different labels.

    Celikkanat and Malliaros [TNE: A Latent Model for Representation Learning on Networks (Oct 2018)] cite this subnetwork, but report (their Table 1) 3,890 nodes, 38,739 edges and 50 clusters (i.e., classes):

    arxiv1810.06917-t1.png

    [Image source. Click image to open in new window.]



Citation Datasets

arxiv1809.10341-t1.png

[Image source. Click image to open in new window.]



CiteSeer for Entity Resolution.  The CiteSeer for Entity Resolution dataset contains 1,504 machine learning documents with 2,892 author references to 165 author entities. For this dataset, the only attribute information available is author name. The full last name is always given, and in some cases the author’s full first name and middle name are given and other times only the initials are given.

CiteSeer for Document Classification.  The CiteSeer for Document Classification dataset consists of 3,312 scientific publications classified into one of six classes. The citation network consists of 4,732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3,703 unique words. The README file in the dataset provides more details.


The Cora dataset [GitHub] consists of 2,708 scientific publications classified into one of seven classes, in support the growth of relational machine learning. The citation network consists of 5,429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1,433 unique words.

  • Schema:

    Cora_schema.png

    [Image source. Click image to open in new window.]


  • Most approaches (papers) report on the small subset of the Cora dataset (above). The original dataset, described by Andrew McCallum et al. in Automating the Construction of Internet Portals with Machine Learning (2000) contained >50,000 computer science research papers. These data are available on McCallum’s data page, and here (csv file).

    Bojchevski and Günnemann [Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking (Feb 2018)] additionally extracted from the original data the entire network and named the two datasets

    • CORA ($\small \mathcal{N} = 19,793$, $\small \mathcal{E} = 65,311$, $\small \mathcal{D} = 8,710$, $\small \mathcal{K} = 70$) and
    • CORA-ML ($\small \mathcal{N} = 2,995$, $\small \mathcal{E} = 8,416$, $\small \mathcal{D} = 2,879$, $\small \mathcal{K} = 7$)

    where $\small \mathcal{N}$ are nodes, $\small \mathcal{E}$ are edges, $\small \mathcal{D}$ is a dimensional attribute vector of the $\small i^{th}$ node, and $\small \mathcal{K}$ is a hyper-parameter denoting the maximum distance we are wiling to consider ($\small k$ hops) [i.e.m number of categories? – refer below]. Those datasets are available here: project pageGitHub.

    arxiv1707.03815-f5.png

    [Image source. Click image to open in new window.]


  • That CORA dataset was employed in [MotifNet: A Motif-Based Graph Convolutional Network for Directed Graphs], which further described it:

    “We tested our approach on the directed CORA citation network. The vertices of the CORA graph represent 19,793 scientific papers, and directed edges of the form $\small (i,j)$ represent citation of paper $\small j$ in paper $\small i$. The content of each paper is represented by a vector of 8,710 numerical features (term frequency-inverse document frequency of various words that appear in the corpus), to which we applied PCA taking the first 130 components. The task is to classify the papers into one of the 70 different categories.”


The PubMed Diabetes.  The PubMed Diabetes dataset The Pubmed Diabetes dataset consists of 19,717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44,338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

  • Dataset (at LINQS]: LINQS Statistical Relational Learning Group: Datasets
  • Paper: Query-driven Active Surveying for Collective Classification (2012):

    “In these experiments, we use four real-world networks: Cora, CiteSeer, Wikipedia, and PubMed. … Finally, the PubMed citation network is a set of articles related to diabetes from the PubMed database. Node attributes are TF/IDF-weight word frequencies and the labels specify the type of diabetes addressed in the publication.”

    arxiv1809.02709-t1_descr.png

    [Image source. Click image to open in new window.]

GLUE Dataset

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Apr 2018; updated Sep 2018) [FAQLeaderboard]

“For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.”

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of diverse natural language understanding tasks. Most of the GLUE datasets have already existed for a number of years, but the purpose of GLUE is to (1) distribute these datasets with canonical Train, Dev, and Test splits, and (2) set up an evaluation server to mitigate issues with evaluation inconsistencies and Test set overfitting. GLUE does not distribute labels for the Test set and users must upload their predictions to the GLUE server for evaluation, with limits on the number of submissions.

  • CoLA. The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not.

  • MNLI. Multi-Genre Natural Language Inference is a large-scale, crowdsourced entailment classification task. Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.

  • MRPC. Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

  • QNLI. Question Natural Language Inference is a version of the Stanford Question Answering Dataset which has been converted to a binary classification task. The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.

  • QQP. Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent.

  • RTE. Recognizing Textual Entailment is a binary entailment task similar to MNLI, but with much less training data.

  • SST-2. The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment.

  • STS-B. The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources. They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.

  • WNLI. Winograd NLI is a small natural language inference dataset. The GLUE webpage notes that there are issues with the construction of this dataset, and every trained system that’s been submitted to GLUE has has performed worse than the 65.1 baseline accuracy of predicting the majority class.


arxiv1703.08098-table2.png

[Image source. Click image to open in new window.]

WN18 is a subset of WordNet (a large lexical knowledge graph where entities are synonyms which express distinct concepts and relations) which consists of 18 relations and 40,943 entities (“generic facts”). Most of the 151,442 triples consist of hyponym and hypernym relations and, for such a reason, WN18 tends to follow a strictly hierarchical structure.

  • WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets
    Source
    , provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. A synset is a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

  • “A popular relation prediction dataset for WordNet is the subset curated as WN18, containing 18 relations for about 41,000 synsets extracted from WordNet 3.0. It has been noted that this dataset suffers from considerable leakage: edges from reciprocal relations such as hypernym/hyponym appear in one direction in the training set and in the opposite direction in dev/test. This allows trivial rule-based baselines to achieve high performance.

    To alleviate this concern, Dettmers et al. (Jul 2018) released the WN18RR set, removing seven relations altogether. However, even this dataset retains four symmetric relation types: ‘also see’, ‘derivationally related form’, ‘similar to’, and ‘verb group’. These symmetric relations can be exploited by defaulting to a simple rule-based predictor.” [Source: Section 4.1 in Predicting Semantic Relations using Global Graph Properties; references therein.]


WN18RR corrects flaws in WN18: WN18RR reclaims WN18 as a dataset, which cannot easily be completed using a single rule, but requires modeling of the complete knowledge graph. WN18RR contains 93,003 triples with 40,943 entities and 11 relations. For future research, we recommend against using FB15k and WN18 and instead recommend FB15k-237, WN18RR, and YAGO3-10.


FB15k is a subset of Freebase which contains 1,345 relations among 14,951 entities. The training set contains 483,142 triples, the validation set 50,000 triples, and the test set 59,071 triples. 454 rules were created for FB15k. A large fraction of content in this knowledge graph describes facts about movies, actors, awards, sports, and sport teams.


FB15k-237  [see also], which corrects errors in FB15k, contains about 14,541 entities with 237 different relations. This dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. The knowledge base triples are a subset of the FB15k set. The textual mentions are derived from 200 million sentences from the ClueWeb12 corpus coupled with FACC1 Freebase entity mention annotations.


YAGO (YAGO3) is a large semantic knowledge base, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data sources. Currently, YAGO knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities.

YAGO is special in several ways:

  • The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value. [Not every version of YAGO is manually evaluated. Most notably, the version generated by this code may not be the one that we evaluated! Check the versions on the YAGO download page.]

  • YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.

  • YAGO is anchored in time and space. YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities.

  • In addition to taxonomy, YAGO has thematic domains such as “music” or “science” from WordNet Domains.

  • YAGO extracts and combines entities and facts from 10 Wikipedia in different languages.

  • “YAGO is a lightweight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the $\small Is-A$ hierarchy as well as non-taxonomic relations between entities (such as $\small hasWonPrize$). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper (YAGO 2007). The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships – and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. …”

  • “YAGO is a large knowledge base that is built automatically from Wikipedia, WordNet and GeoNames. The project combines information from Wikipedia in 10 different languages, thus giving the knowledge a multilingual dimension. It also attaches spatial and temp oral information to many facts, and thus allows the user to query the data over space and time. YAGO focuses on extraction quality and achieves a manually evaluated precision of 95%. In this paper (Yago 2016), we explain from a general perspective how YAGO is built from its sources, how its quality is evaluated, how a user can access it, and how other projects utilize it.”

  • References:


YAGO3-10 (a subset of YAGO3) consists of entities which have a minimum of 10 relations each. It has 123,182 entities and 37 relations. Most of the triples deal with descriptive attributes of people, such as citizenship, gender, and profession.


YAGO37 is extracted from the core facts of YAGO3, containing 37 relations and 123,189 entities. This dataset was created by the RUGE authors. The YAGO37 data set consists of 37 relations among 123,189 entities. The training set contains 989,132 triples, the validation set 50,000 triples, and the test set 50,000 triples. 16 rules were created for YAGO37. “All triples are unique and we made sure that all entities/relations appearing in the validation or test sets were occurring in the training set.”


Machine Reading Comprehension Datasets

arxiv1810.13441-t1.png

[Image source. Click image to open in new window.]

DuoRC. In April 2018 IBM Research introduced a new dataset for reading comprehension (DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension)  [projectdata]. DuoRC is a large scale reading comprehension (RC) dataset of 186K human-generated QA pairs created from 7,680 pairs of parallel movie plots taken from Wikipedia and IMDb. By design, DuoRC ensures very little or no lexical overlap between the questions created from one version and segments containing answers in the other version.


KBMRC dataset. Machine reading comprehension (MRC) requires reasoning about both the knowledge involved in a document and knowledge about the world. However, existing datasets are typically dominated by questions that can be well solved by context matching, which fail to test this capability. Microsoft Research Asia recently published Knowledge Based Machine Reading Comprehension (Sep 2018), which addressed knowledge-based MRC, and built a new [unnamed] dataset consisting of 40,047 question-answer pairs. The annotation of this dataset was designed so that successfully answering the questions required understanding as well as the knowledge involved in a document.


MS-MARCO. Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension, a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset defines the task as answering a question from multiple passages and the words in the answer are not necessary in the passages.

arxiv1706.04815-table1.png

[Image source. Click image to open in new window.]



MultiRC, from the University of Pennsylvania, is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. The goal of this dataset is to encourage the research community to explore approaches that can do more than sophisticated lexical-level matching. See MultiRC: Reading Comprehension over Multiple Sentences;  [projectcode].


NarrativeQA requires understand of an underlying narrative by asking the reader to answer questions about stories by reading entire books or movie scripts. See The NarrativeQA Reading Comprehension Challenge  [GitHub].


QAngaroo focuses on reading comprehension that requires the gathering of several pieces of information via multiple steps of inference. See Constructing Datasets for Multi-hop Reading Comprehension Across Documents  [project].


RACE is a dataset for benchmark evaluation of methods in the reading comprehension task. RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students’ ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state of the art models (43%) and the ceiling human performance (95%).

arxiv1704.04683-t1.png

[Image source. Click image to open in new window.]

SQuAD. The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD2.0 is a challenging natural language understanding task for existing models, and we release SQuAD2.0 to the community as the successor to SQuAD1.1. We are optimistic that this new dataset will encourage the development of reading comprehension systems that know what they don’t know.


TACRED. The TACRED relation extraction dataset was introduced by Zhang et al. in their paper, Position-aware Self-attention with Relative Positional Encodings for Slot Filling  [code]. See also Position-aware Attention and Supervised Data Improve Slot Filling. TACRED is a large (106,264 examples) supervised relation extraction dataset, obtained via crowdsourcing and targeted towards TAC KBP relations.


WikiHop. “We define a novel RC [reading comprehension] task in which a model should learn to answer queries by combining evidence stated across documents. We introduce a methodology to induce datasets for this task and derive two datasets.

  • The first, WikiHop, uses sets of Wikipedia articles where answers to queries about specific properties of an entity cannot be located in the entity’s article.

  • In the second dataset, MedHop, the goal is to establish drug-drug interactions based on scientific findings about drugs and proteins and their interactions, found across multiple Medline abstracts.

  • For both datasets we draw upon existing Knowledge Bases, Wikidata and Drugbank as ground truth …” See Constructing Datasets for Multi-hop Reading Comprehension Across Documents (Jun 2018).

    arxiv1710.06481-fig1.png

    [Image source. Click image to open in new window.]


    arxiv1710.06481-fig3.png

    [Image source. Click image to open in new window.]



Machine Reasoning Datasets


AI2 Reasoning Challenge (ARC). The Allen Institute for Artificial Intelligence (AI2) presented a question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. See Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge;  [projectcode].

  • “Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community.”

  • ARC was recently used in Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Scientific Question Answering by authors at UC San Diego and Microsoft AI Research. Existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper, the authors proposed a retriever-reader model that learned to attend on [via self-attention layers] essential terms during the question answering process via an essential-term-aware “retriever” which first identified the most important words in a question, then reformulated the queries and searches for related evidence, and an enhanced “reader” to distinguish between essential terms and distracting words to predict the answer. On the ARC dataset their model outperformed the existing state of the art [e.g., BiDAF] by 8.1%.


bAbI;  [code] is a set of 20 question answering tasks for testing text understanding and reasoning. The dataset is composed of a set of contexts, with multiple question answer pairs available based on the contexts.

  • “bAbI is a synthetic reading comprehension dataset, created by Facebook AI researchers in 2015. The term synthetic data refers to data that is not extracted from a book or from the internet, but is generated by using a few rules that simulate natural language. This characteristic of bAbI places the weight of the task on the reasoning module rather than the understanding module. Question Answering data sets provide synthetic tasks for the goal of helping to develop learning algorithms for understanding and reasoning.” [Source.]

arxiv1810.12698-t1.png

[Image source. Click image to open in new window.]


arxiv1502.05698-t1.png

[Image source. Click image to open in new window.]


arxiv1502.05698-t2.png

[Image source. Click image to open in new window.]



bAbI-10k English (Weston et al., 2015a) is a synthetic dataset which features 20 different tasks. Each example is composed of a set of facts, a question, the answer, and the supporting facts that lead to the answer. The dataset comes in two sizes, referring to the number of training examples each task has: bAbI-1k and bAbI-10k. See Weston et al. (Facebook AI Research), Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (Dec 2015).


CLEVR is a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. “We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.”

arxiv1612.06890-f1.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f1.png

[Image source. Click image to open in new window.]



Natural Language Inference Datasets


Breaking NLI is a new natural language inference (NLI) dataset, described by Glockner et al. [Yoav Goldberg] in Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (May 2018) [GitHub]:

  • “We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.”

arxiv1805.02266-table1.png

[Image source. Click image to open in new window.]


arxiv1805.02266-table3.png

[Image source. Click image to open in new window.]



MultiNLI corpus, described in Multi-Genre Natural Language Inference, is a newer natural language inference (NLI) corpus than SNLI. MultiNLI is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information: entailment, contradiction and neutral. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

  • Unlike the SNLI corpus, which draws the premise sentence from image captions, MultiNLI consists of sentence pairs from ten distinct genres of both written and spoken English (e.g. fiction, government text or spoken telephone conversations). The dataset is divided into training (392,702 pairs), development (20,000 pairs) and test sets (20,000 pairs).

  • All of the genres are included in the test and development sets, but only five are included in the training set. The development and test datasets have been divided into matched and mismatched, where the former includes only sentences from the same genres as the training data, and the latter includes sentences from the remaining genres not present in the training data.

  • In addition to the training, development and test sets, MultiNLI provides a smaller annotation dataset, which contains approximately 1000 sentence pairs annotated with linguistic properties of the sentences and is split between the matched and mismatched datasets. [The annotated dataset and description of the annotations are available at multinli_1.0_ annotations.zip.]

  • This annotation dataset provides a simple way to assess what kind of sentence pairs an NLI system is able to predict correctly and where it makes errors.

  • Source for the points, above: Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture.

  • MultiNLI is modeled after SNLI. The two corpora are distributed in the same formats, and for many applications, it may be productive to treat them as a single, larger corpus.

  • MultiNLI is described in A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (Apr 2017; updated Feb 2018).

    arxiv1704.05426-table1.png

    [Image source. Click image to open in new window.]


  • “The MultiNLI premise sentences are derived from ten sources [genres] of freely available text which are meant to be maximally diverse and roughly represent the full range of American English:

    • FACE-TO-FACE: transcriptions from the Charlotte Narrative and Conversation Collection of two-sided, in-person conversations that took place in the early 2000s;
    • GOVERNMENT: reports, speeches, letters, and press releases from public domain government websites;
    • LETTERS: letters from the Indiana Center for Intercultural Communication of Philanthropic Fundraising Discourse written in the late 1990s-early 2000s;
    • 9/11: the public report from the National Commission on Terrorist Attacks Upon the United States released on July 22, 2004 2;
    • OUP: five non-fiction works on the textile industry and child development published by the Oxford University Press);
    • SLATE: popular culture articles from the archives of Slate Magazine written between 1996-2000;
    • TELEPHONE: transcriptions from University of Pennsylvania’s Linguistic Data Consortium Switchboard corpus of two-sided, telephone conversations that took place in 1990 or 1991;
    • TRAVEL: travel guides published by Berlitz Publishing in the early 2000s;
    • VERBATIM: short posts about linguistics for non-specialists from the Verbatim archives written between 1990 and 1996; and
    • FICTION: for our tenth genre, we compile several freely available works of contemporary fiction written between 1912 and 2010 spanning various genres including mystery, humor, science fiction, and adventure.

  • Breaking NLI updates MultiNLI, which updates SNLI.


SNLI (Stanford Natural Language Inference Corpus. Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise $\small p$ (e.g. a piece of text) and a hypothesis $\small h$. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI) Corpus, contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels “entailment,” “contradiction,” and “neutral,” supporting the task of NLI.


Natural Language Processing Datasets


The Billion Word dataset contains 768M word tokens and has a vocabulary of about 800K word types, which corresponds to words with more than 3 occurrences in the training set.

  • Paper: Chelba et al. [… Tomas Mikolov …], One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling (Google: Dec 2013; updated Mar 2014):

    “We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. …”


Gigaword.  The English Gigaword is a sentence summarization dataset based on Annotated Gigaword (Napoles et al., 2012), a dataset consisting of sentence pairs, which are the first sentence of the collected news articles and the corresponding headlines. Those data were preprocessed by Rush et al. (2015) (Jason Weston/Facebook AI Research; Sep 2015) with 3.8M sentence pairs for training, 8K for validation and 2K for testing.

  • “For training data for both tasks, we (Rush et al. (2015) utilize the annotated Gigaword data set (Graff et al., 2003; Napoles et al., 2012), which consists of standard Gigaword, preprocessed with Stanford CoreNLP tools (Manning et al., 2014). Our model only uses annotations for tokenization and sentence separation, although several of the baselines use parsing and tagging as well. Gigaword contains around 9.5 million news articles sourced from various domestic and international news services over the last two decades.”

MS-COCO.  COCO is a large-scale object detection, segmentation, and captioning dataset. See Microsoft COCO: Common Objects in Context (Feb 2015). “We present a new dataset with the goal of advancing the state of the art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. …”


Penn Treebank (PTB).

  • Preprocessed by Mikolov et al. in Recurrent Neural Network Based Language Model (Sep 2010) [local copy]. Old (still extant: 2018-10-18) project page: RNNLM Toolkit  (search GitHub, “rnnlm toolkit.

  • From Pointer Sentinel Mixture Models (Sep 2016):

    “In order to compare our model to the many recent neural language models, we conduct word-level prediction experiments on the Penn Treebank (PTB) dataset (Marcus et al., 1993), pre-processed by Mikolov et al. (2010). The dataset consists of 929k training words, 73k validation words, and 82k test words. As part of the pre-processing performed by Mikolov et al. (2010):

    • words were lower-cased,
    • numbers were replaced with $\small N$,
    • newlines were replaced with $\small \langle eos \rangle$, and
    • all other punctuation was removed.

    “The vocabulary is the most frequent 10k words with the rest of the tokens being replaced by an $\small \langle unk \rangle$ token. For full statistics, refer to Table 1 [below].”

    arxiv1609.07843-t1.png

    [Image source. Click image to open in new window.]


    Reasons for a New Dataset [WikiText-103].

    “While the processed version of the Penn Treebank has been frequently used for language modeling, it has many limitations. The tokens in PTB are all lower case, stripped of any punctuation, and limited to a vocabulary of only 10k words. These limitations mean that the PTB is unrealistic for real language use, especially when far larger vocabularies with many rare words are involved. Fig. 3
    Source
    illustrates this using a Zipfian plot over the training partition of the PTB. The curve stops abruptly when hitting the 10k vocabulary. Given that accurately predicting rare words, such as named entities, is an important task for many applications, the lack of a long tail for the vocabulary is problematic.”


SciERC. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction  [data/code] introduced a joint multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. They created “SciERC,” a dataset that included annotations for all three tasks, and developed a unified framework called Scientific Information Extractor (SciIE) with shared span representations. [SciIE was able to automatically organize the extracted information from a large collection of scientific articles into a knowledge graph. …]


SciTail. The SciTail dataset is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs. Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis. The dataset is divided into training (23,596 pairs), development (1,304 pairs) and test sets (2,126 pairs). Unlike the SNLI and MultiNLI datasets, SciTail uses only two labels: entailment and neutral .


WikiText-103. The training data of WikiText-103 comprises about 100M tokens and a vocabulary of around 260K, corresponding to types with more than 3 occurrences in the training data. The dataset is composed of shuffled Wikipedia articles where the context carries across sentences.

arxiv1609.07843-t1.png

[Image source. Click image to open in new window.]


  • Paper: Merity et al. [Richard Socher | MetaMind/Salesforce] Pointer Sentinel Mixture Models (Sep 2016):

    “… While the processed version of the Penn Treebank has been frequently used for language modeling, it has many limitations. The tokens in PTB are

    • all lower case,
    • stripped of any punctuation, and
    • limited to a vocabulary of only 10k words.

    “These limitations mean that the PTB is unrealistic for real language use, especially when far larger vocabularies with many rare words are involved. Fig. 3
    Source
    illustrates this using a Zipfian plot over the training partition of the PTB. The curve stops abruptly when hitting the 10k vocabulary. Given that accurately predicting rare words, such as named entities, is an important task for many applications, the lack of a long tail for the vocabulary is problematic.”

    Construction and Preprocessing

    “We selected articles only fitting the ‘Good’ or ‘Featured’ article criteria specified by editors on Wikipedia. These articles have been reviewed by humans and are considered well written, factually accurate, broad in coverage, neutral in point of view, and stable. This resulted in 23,805 Good articles and 4,790 Featured articles. The text for each article was extracted using the Wikipedia API. Extracting the raw text from Wikipedia mark-up is nontrivial due to the large number of macros in use. These macros are used extensively and include metric conversion, abbreviations, language notation, and date handling.

    “Once extracted, specific sections which primarily featured lists were removed by default. Other minor bugs, such as sort keys and Edit buttons that leaked in from the HTML, were also removed. Mathematical formulae and $\small \LaTeX$ code, were replaced with $\small \langle formula \rangle$ tokens. Normalization and tokenization were performed using the Moses tokenizer (Koehn et al., 2007), slightly augmented to further split numbers $\small (8,600 → 8 @,@ 600)$ and with some additional minor fixes. Following Chelba et al. (2013) a vocabulary was constructed by discarding all words with a count below 3. Words outside of the vocabulary were mapped to the $\small \langle unk \rangle$ token, also a part of the vocabulary.

    “To ensure the dataset is immediately usable by existing language modeling tools, we have provided the dataset in the same format and following the same conventions as that of the PTB dataset.”


Other Datasets


Bio-1kb  |  Bio-1kb-Hic  Although it gives the wrong citation (Belkin and Niyogi, 2001 – which ridiculously predates the Hi-C method), Adaptive Edge Features Guided Graph Attention Networks gives a good description of this dataset:

  • “One advantage of our EGAT model is to utilize the real-valued edge weights as features rather than simple binary indicators. To test the effectiveness of edge weights, we conduct experiments on a biological network: Bio1kb-Hic (Belkin and Niyogi 2001), which is a chromatin interaction network collected by chromosome conformation capture (3C) based technologies with 1kb resolution. The nodes in the network are genetic regions, and edges are strengths of interactions between two genetic regions. The dataset contains 7,023 nodes and 3,087,019 edges. The nodes are labeled with 18 categories, which indicate the genetic functions of genetic regions. There are no attributes associated with the nodes. We split the nodes into 3 subsets with sizes 75%, 15% and 15% for training, validation and testing, respectively. Note that the genetic interactions measured by 3C-based technologies are very noisy. Therefore, the Bio1kb-Hic dataset is highly dense and noisy. To test the noise-sensitivity of the algorithms, we generate a denoised version of the dataset by setting edge weights less than 5 to be 0. In the denoised version, the number of edges is reduced from 3,087,019 to 239,543 – which is much less dense.”

  • Datasets:

    “We applied NE to a Hi-C [chromosome conformation capture
    ] interaction networks. Hi-C is a 3C-based technology that allows measurement of pairwise chromatin interaction frequencies within a cell population. Hi-C read data can be thought of as a network where genomic regions are nodes and the normalized read counts mapped to two bins are weighted edges. Visual inspection of the Hi-C contact matrix before and after Hi-C network is denoised using NE reveals an enhancement of edges within each community and sharper boundaries between communities (figure below). This improvement is particularly clear for the 5kb resolution data, where communities that were visually undetectable in the raw data become clear after denoising with NE.”