Dataset Descriptions

Last modified: 2019-06-14


Bioinformatics Datasets


[Image source. Click image to open in new window.]

  • MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds with 7 discrete labels. See also Classification Statistics on the MUTAG, ENZYMES Datasets

  • PROTEINS is a dataset where nodes are secondary structure elements (SSEs) and there is an edge between two nodes if they are neighbors in the amino-acid sequence or in 3D space. It has 3 discrete labels, representing helix, sheet or turn.

  • PTC is a dataset of 344 chemical compounds that reports the carcinogenicity for male and female rats and it has 19 discrete labels.

  • NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets of chemical compounds screened for ability to suppress or inhibit the growth of a panel of human tumor cell lines, having 37 discrete labels.

  • Reference, datasets: Benchmark Data Sets for Graph Kernels (2016)

  • Used in: How Powerful are Graph Neural Networks?

Protein-Protein Interaction Databases

PPI (Protein-Protein Interaction) is a Homo Sapiens PPI network where each label corresponds to a biological state.

  • Source: BioGRID. BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.5.166 and searches 67,477 publications for 1,623,645 protein and genetic interactions, 28,093 chemical associations and 726,378 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.

  • Yoshua Bengio and colleagues describe this dataset in Deep Graph Infomax (Feb 2018): “We make use of a protein-protein interaction (PPI) dataset that consists of graphs corresponding to different human tissues (Zitnik & Leskovec, 2017). The dataset contains 20 graphs for training, 2 for validation and 2 for testing. Critically, testing graphs remain completely unobserved during training. To construct the graphs, we used the preprocessed data provided by Hamilton et al. (2017). The average number of nodes per graph is 2,372. Each node has 50 features that are composed of positional gene sets, motif gene sets and immunological signatures. There are 121 labels for each node set from gene ontology, collected from the Molecular Signatures Database (Subramanian et al., 2005), and a node can possess several labels simultaneously.”


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Grover and Leskovec, node2vec: Scalable Feature Learning for Networks (Jul 2016) used a subgraph of the PPI network for Homo sapiens.  The subgraph corresponded to the graph induced by nodes for which they could obtain labels from the hallmark gene sets and represent biological states. The subnetwork had 3,890 nodes, 76,584 edges, and 50 different labels.

    Celikkanat and Malliaros [TNE: A Latent Model for Representation Learning on Networks (Oct 2018)] cite this subnetwork, but report (their Table 1) 3,890 nodes, 38,739 edges and 50 clusters (i.e., classes):


    [Image source. Click image to open in new window.]

  • Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets (Jun 2019) [code]  “… we introduce the Biomedical Language Understanding Evaluation (BLUE) benchmark to facilitate research in the development of pre-training language representations in the biomedicine domain. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT  and ELMo  and find that the BERT  model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available here.”


    [Image source. Click image to open in new window.]

Citation Datasets


[Image source. Click image to open in new window.]

CiteSeer for Entity Resolution.  The CiteSeer for Entity Resolution dataset contains 1,504 machine learning documents with 2,892 author references to 165 author entities. For this dataset, the only attribute information available is author name. The full last name is always given, and in some cases the author’s full first name and middle name are given and other times only the initials are given.

CiteSeer for Document Classification.  The CiteSeer for Document Classification dataset consists of 3,312 scientific publications classified into one of six classes. The citation network consists of 4,732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3,703 unique words. The README file in the dataset provides more details.

The Cora dataset [GitHub] consists of 2,708 scientific publications classified into one of seven classes, in support the growth of relational machine learning. The citation network consists of 5,429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1,433 unique words.

  • Schema:


    [Image source. Click image to open in new window.]

  • Most approaches (papers) report on the small subset of the Cora dataset (above). The original dataset, described by Andrew McCallum et al. in Automating the Construction of Internet Portals with Machine Learning (2000) contained >50,000 computer science research papers. These data are available on McCallum’s data page, and here (csv file).

    Bojchevski and Günnemann [Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking (Feb 2018)] additionally extracted from the original data the entire network and named the two datasets

    • CORA ($\small \mathcal{N} = 19,793$, $\small \mathcal{E} = 65,311$, $\small \mathcal{D} = 8,710$, $\small \mathcal{K} = 70$) and
    • CORA-ML ($\small \mathcal{N} = 2,995$, $\small \mathcal{E} = 8,416$, $\small \mathcal{D} = 2,879$, $\small \mathcal{K} = 7$)

    where $\small \mathcal{N}$ are nodes, $\small \mathcal{E}$ are edges, $\small \mathcal{D}$ is a dimensional attribute vector of the $\small i^{th}$ node, and $\small \mathcal{K}$ is a hyper-parameter denoting the maximum distance we are wiling to consider ($\small k$ hops) [i.e.m number of categories? – refer below]. Those datasets are available here: project pageGitHub.


    [Image source. Click image to open in new window.]

  • That CORA dataset was employed in [MotifNet: A Motif-Based Graph Convolutional Network for Directed Graphs], which further described it:

    “We tested our approach on the directed CORA citation network. The vertices of the CORA graph represent 19,793 scientific papers, and directed edges of the form $\small (i,j)$ represent citation of paper $\small j$ in paper $\small i$. The content of each paper is represented by a vector of 8,710 numerical features (term frequency-inverse document frequency of various words that appear in the corpus), to which we applied PCA taking the first 130 components. The task is to classify the papers into one of the 70 different categories.”

The PubMed Diabetes.  The PubMed Diabetes dataset The Pubmed Diabetes dataset consists of 19,717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44,338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

  • Dataset (at LINQS]: LINQS Statistical Relational Learning Group: Datasets
  • Paper: Query-driven Active Surveying for Collective Classification (2012):

    “In these experiments, we use four real-world networks: Cora, CiteSeer, Wikipedia, and PubMed. … Finally, the PubMed citation network is a set of articles related to diabetes from the PubMed database. Node attributes are TF/IDF-weight word frequencies and the labels specify the type of diabetes addressed in the publication.”


    [Image source. Click image to open in new window.]

GLUE Dataset

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Apr 2018; updated Sep 2018) [FAQLeaderboard]

“For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.”

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of diverse natural language understanding tasks. Most of the GLUE datasets have already existed for a number of years, but the purpose of GLUE is to (1) distribute these datasets with canonical Train, Dev, and Test splits, and (2) set up an evaluation server to mitigate issues with evaluation inconsistencies and Test set overfitting. GLUE does not distribute labels for the Test set and users must upload their predictions to the GLUE server for evaluation, with limits on the number of submissions.

  • CoLA. The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not.

  • MNLI. Multi-Genre Natural Language Inference is a large-scale, crowdsourced entailment classification task. Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.

  • MRPC. Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

  • QNLI. Question Natural Language Inference is a version of the Stanford Question Answering Dataset which has been converted to a binary classification task. The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.

  • QQP. Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent.

  • RTE. Recognizing Textual Entailment is a binary entailment task similar to MNLI, but with much less training data.

  • SST-2. The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment.

  • STS-B. The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources. They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.

  • WNLI. Winograd NLI is a small natural language inference dataset. The GLUE webpage notes that there are issues with the construction of this dataset, and every trained system that’s been submitted to GLUE has has performed worse than the 65.1 baseline accuracy of predicting the majority class.

  • GLUE Leaderboard. The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:

    • A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
    • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
    • A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.


    GLUE leaderboard scores: 2019-02-28.  [Image source. Click image to open in new window.]

  • SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (May 2019).  “The GLUE benchmark, introduced one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently come close to the level of non-expert humans, suggesting limited headroom for further research. This paper recaps lessons learned from the GLUE benchmark and presents SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. SuperGLUE will is available here.”


It can be difficult to keep pace with the pace of development in NLP/ML. Fortunately, there are several leaderboards, which track progress.

  • AI2 Reasoning Challenge (ARC)
  • CoQA
  • GLUE
  • Natural Language Processing [various: click on topic for Leaderboard; maintained by Sebastian Ruder)
  • Natural Questions [Google Research]
  • SQuAD

  • CommonsenseQA, a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. It contains 9,500 questions with one correct answer and two distractor answers.

  • Second Conversational Intelligence Challenge (ConvAI2):  paperprojectleaderboard. “There are currently few datasets appropriate for training and evaluating models for non-goal-oriented dialogue systems (chatbots); and equally problematic, there is currently no standard procedure for evaluating such models beyond the classic Turing test. The aim of our competition is therefore to establish a concrete scenario for testing chatbots that aim to engage humans, and become a standard evaluation tool in order to make such systems directly comparable. …”

  • DAWNBench.  DAWNBench is a benchmark suite for end-to-end deep learning training and inference. Computation time and cost are critical resources in building deep models, yet many existing benchmarks focus solely on model accuracy. DAWNBench provides a reference set of common deep learning workloads for quantifying training time, training cost, inference latency, and inference cost across different optimization strategies, model architectures, software frameworks, clouds, and hardware.

    • [2019-03-04] The first iteration of DAWNBench is over, and the competition results and key takeaways have been finalized. However, we are still curious to see how well people can do on this benchmark and are now accepting rolling submissions. The original results before the April 20, 2018 deadline are archived for reference. For a more comprehensive benchmark, please consider submitting to the updated MLPerf benchmark.

    • Question Answering on SQuAD.  Objective: Time taken to train a question answering model to a $\small F_1$ score of 0.75 or greater on the SQuAD development dataset. development dataset.

    • FastFusionNet: New State-of-the-Art for DAWNBench SQuAD (Mar 2019) [code].  “In this technical report, we introduce FastFusionNet, an efficient variant of FusionNet. FusionNet is a high performing reading comprehension architecture, which was designed primarily for maximum retrieval accuracy with less regard towards computational requirements. For FastFusionNets we remove the expensive CoVe layers and substitute the BiLSTMs with far more efficient SRU layers. The resulting architecture obtains state-of-the-art results on DAWNBench while achieving the lowest training and inference time on SQuAD to-date. The code is available at GitHub.”


      [Image source. Click image to open in new window.]

  • Language Modelling on Hutter Prize

  • QAngaroo Leaderboards

  • Salesforce WikiSQL Challenge. This is a large crowd-sourced dataset for developing natural language interfaces for relational databases. This challenge uses a large crowd-sourced dataset based on Wikipedia, called WikiSQL, with the AI required to answer natural language questions from the dataset.

  • Salesforce decaNLP Challenge. The Natural Language Decathlon is a multitask challenge that spans ten tasks: question answering (SQuAD), machine translation (IWSLT), summarization (CNN/DM), natural language inference (MNLI), sentiment analysis (SST), semantic role labeling (QA‑SRL), zero-shot relation extraction (QA‑ZRE), goal-oriented dialogue (WOZ, semantic parsing (WikiSQL), and commonsense reasoning (MWSC). Each task is cast as question answering, which makes it possible to use our new Multitask Question Answering Network (MQAN). This model jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting.

  • Standford SNLI Leaderboard


[Image source. Click image to open in new window.]

WN18 is a subset of WordNet (a large lexical knowledge graph where entities are synonyms which express distinct concepts and relations) which consists of 18 relations and 40,943 entities (“generic facts”). Most of the 151,442 triples consist of hyponym and hypernym relations and, for such a reason, WN18 tends to follow a strictly hierarchical structure.

  • WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets
    , provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. A synset is a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

  • “A popular relation prediction dataset for WordNet is the subset curated as WN18, containing 18 relations for about 41,000 synsets extracted from WordNet 3.0. It has been noted that this dataset suffers from considerable leakage: edges from reciprocal relations such as hypernym/hyponym appear in one direction in the training set and in the opposite direction in dev/test. This allows trivial rule-based baselines to achieve high performance.

    To alleviate this concern, Dettmers et al. (Jul 2018) released the WN18RR set, removing seven relations altogether. However, even this dataset retains four symmetric relation types: ‘also see’, ‘derivationally related form’, ‘similar to’, and ‘verb group’. These symmetric relations can be exploited by defaulting to a simple rule-based predictor.” [Source: Section 4.1 in Predicting Semantic Relations using Global Graph Properties; references therein.]

WN18RR corrects flaws in WN18: WN18RR reclaims WN18 as a dataset, which cannot easily be completed using a single rule, but requires modeling of the complete knowledge graph. WN18RR contains 93,003 triples with 40,943 entities and 11 relations. For future research, we recommend against using FB15k and WN18 and instead recommend FB15k-237, WN18RR, and YAGO3-10.

FB15k is a subset of Freebase which contains 1,345 relations among 14,951 entities. The training set contains 483,142 triples, the validation set 50,000 triples, and the test set 59,071 triples. 454 rules were created for FB15k. A large fraction of content in this knowledge graph describes facts about movies, actors, awards, sports, and sport teams.

FB15k-237  [see also], which corrects errors in FB15k, contains about 14,541 entities with 237 different relations. This dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. The knowledge base triples are a subset of the FB15k set. The textual mentions are derived from 200 million sentences from the ClueWeb12 corpus coupled with FACC1 Freebase entity mention annotations.

YAGO (YAGO3) is a large semantic knowledge base, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data sources. Currently, YAGO knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities.

YAGO is special in several ways:

  • The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value. [Not every version of YAGO is manually evaluated. Most notably, the version generated by this code may not be the one that we evaluated! Check the versions on the YAGO download page.]

  • YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.

  • YAGO is anchored in time and space. YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities.

  • In addition to taxonomy, YAGO has thematic domains such as “music” or “science” from WordNet Domains.

  • YAGO extracts and combines entities and facts from 10 Wikipedia in different languages.

  • “YAGO is a lightweight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the $\small Is-A$ hierarchy as well as non-taxonomic relations between entities (such as $\small hasWonPrize$). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper (YAGO 2007). The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships – and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. …”

  • “YAGO is a large knowledge base that is built automatically from Wikipedia, WordNet and GeoNames. The project combines information from Wikipedia in 10 different languages, thus giving the knowledge a multilingual dimension. It also attaches spatial and temp oral information to many facts, and thus allows the user to query the data over space and time. YAGO focuses on extraction quality and achieves a manually evaluated precision of 95%. In this paper (Yago 2016), we explain from a general perspective how YAGO is built from its sources, how its quality is evaluated, how a user can access it, and how other projects utilize it.”

  • References:

YAGO3-10 (a subset of YAGO3) consists of entities which have a minimum of 10 relations each. It has 123,182 entities and 37 relations. Most of the triples deal with descriptive attributes of people, such as citizenship, gender, and profession.

YAGO37 is extracted from the core facts of YAGO3, containing 37 relations and 123,189 entities. This dataset was created by the RUGE authors. The YAGO37 data set consists of 37 relations among 123,189 entities. The training set contains 989,132 triples, the validation set 50,000 triples, and the test set 50,000 triples. 16 rules were created for YAGO37. “All triples are unique and we made sure that all entities/relations appearing in the validation or test sets were occurring in the training set.”

Machine Reading Comprehension Datasets


[Image source. Click image to open in new window.]


CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. It contains 9,500 questions with one correct answer and two distractor answers.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

CoQA dataset, leaderboard.  CoQA (Stanford University’s Conversational Question Answering Challenge) is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as “coca.” CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.

The unique features of CoQA include:

  • the questions are conversational;
  • the answers can be free-form text;
  • each answer also comes with an evidence subsequence highlighted in the passage; and,
  • the passages are collected from seven diverse domains.

CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension [dataset]

  • “We present DREAM, the first dialogue-based multiple-choice reading comprehension dataset. Collected from English-as-a-foreign-language examinations designed by human experts to evaluate the comprehension level of Chinese learners of English, our dataset contains 10,197 multiple-choice questions for 6,444 dialogues.

    “In contrast to existing reading comprehension datasets, DREAM is the first to focus on in-depth multi-turn multi-party dialogue understanding. DREAM is likely to present significant challenges for existing reading comprehension systems: 84% of answers are non-extractive, 85% of questions require reasoning beyond a single sentence, and 34% of questions also involve commonsense knowledge.

    “We apply several popular neural reading comprehension models that primarily exploit surface information within the text and find them to, at best, just barely outperform a rule-based approach. …”

DROP: Discrete Reasoning Over Paragraphs.  “With system performance on existing reading comprehension benchmarks nearing or surpassing human performance, we need a new, hard dataset that improves systems’ capabilities to actually read paragraphs of text. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.”

  • Paper:  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (Mar 2019).  “Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% $\small F_1$ on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% $\small F_1$.”

  • project  |  demo


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

DuoRC. In April 2018 IBM Research introduced a new dataset for reading comprehension (DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension)  [projectdata]. DuoRC is a large scale reading comprehension (RC) dataset of 186K human-generated QA pairs created from 7,680 pairs of parallel movie plots taken from Wikipedia and IMDb. By design, DuoRC ensures very little or no lexical overlap between the questions created from one version and segments containing answers in the other version.

HotpotQA / leaderboard.  HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

KBMRC dataset. Machine reading comprehension (MRC) requires reasoning about both the knowledge involved in a document and knowledge about the world. However, existing datasets are typically dominated by questions that can be well solved by context matching, which fail to test this capability. Microsoft Research Asia recently published Knowledge Based Machine Reading Comprehension (Sep 2018), which addressed knowledge-based MRC, and built a new [unnamed] dataset consisting of 40,047 question-answer pairs. The annotation of this dataset was designed so that successfully answering the questions required understanding as well as the knowledge involved in a document.

MS-MARCO. Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension, a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset defines the task as answering a question from multiple passages and the words in the answer are not necessary in the passages.


[Image source. Click image to open in new window.]

  • MS MARCO V2 Leaderboard. First released at NeurIPS 2016 the MS MARCO dataset was an ambitious, real-world Machine Reading Comprehension Dataset. Based on feedback from the community, we designed and released the V2 dataset and its related challenges.

MultiRC, from the University of Pennsylvania, is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. The goal of this dataset is to encourage the research community to explore approaches that can do more than sophisticated lexical-level matching. See MultiRC: Reading Comprehension over Multiple Sentences;  [projectcode].

NarrativeQA requires understand of an underlying narrative by asking the reader to answer questions about stories by reading entire books or movie scripts. See The NarrativeQA Reading Comprehension Challenge  [GitHub].


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

QAngaroo focuses on reading comprehension that requires the gathering of several pieces of information via multiple steps of inference. “We define a novel RC [reading comprehension] task in which a model should learn to answer queries by combining evidence stated across documents. We introduce a methodology to induce datasets for this task and derive two datasets.


[Image source. Click image to open in new window.]

  • Paper: Constructing Datasets for Multi-hop Reading Comprehension Across Documents (Oct 2017; updated Jun 2018)

    “Most reading comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence – effectively performing multi-hop (alias multi-step) inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information, as providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 42.9% compared to human performance at 74.0% – leaving ample room for improvement.”

  • Project: “We have created two new Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. Several pieces of information often jointly imply another fact. In multi-hop inference, a new fact is derived by combining facts via a chain of multiple steps. Our aim is to build Reading Comprehension methods that perform multi-hop inference on text, where individual facts are spread out across different documents. The two QAngaroo datasets provide a training and evaluation resource for such methods.

  • Task Overview: “In our task, the goal is to answer text understanding queries by combining multiple facts that are spread across different documents. In each sample, a query is given about a collection of documents. The goal is to identify the correct answer among a set of given type-consistent answer candidates. The candidates – including the correct answer – are mentioned in the documents. We also provide a masked version of both datasets, where candidates are replaced by random placeholder tokens. More details on the rationale behind this can be found in the paper.” Both datasets draw upon existing Knowledge Bases, Wikidata and Drugbank as ground truth …

  • QAngaroo Leaderboards

  • Datasets:

    • WikiHop. The first of the two datasets is open-domain and based on Wikipedia articles; the goal is to recover Wikidata information by hopping through documents. WikiHop uses sets of Wikipedia articles where answers to queries about specific properties of an entity cannot be located in the entity’s article. The example on the right shows the relevant documents leading to the correct answer for the query shown at the bottom.


      [Image source. Click image to open in new window.]

    • MedHop. With the same format as WikiHop, this dataset is based on research paper abstracts from PubMed, and the queries are about interactions between pairs of drugs. The correct answer has to be inferred by combining information from a chain of reactions of drugs and proteins. In MedHop the goal is to establish drug-drug interactions based on scientific findings about drugs and proteins and their interactions, found across multiple Medline abstracts.


      [Image source. Click image to open in new window.]

RACE is a dataset for benchmark evaluation of methods in the reading comprehension task. RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students’ ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state of the art models (43%) and the ceiling human performance (95%).


[Image source. Click image to open in new window.]

SQuAD. The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD2.0 is a challenging natural language understanding task for existing models, and we release SQuAD2.0 to the community as the successor to SQuAD1.1. We are optimistic that this new dataset will encourage the development of reading comprehension systems that know what they don’t know.

  • SQuAD2.0 Leaderboard. BERT models massively dominate this challenge, occupying virtually all of the top ranked leaderboard positions, including (2019-03-05) the 20 top positions (sharing some of those ranks with other architectures).

  • Adversarial Examples for Evaluating Reading Comprehension Systems (Stanford University: Jul 2017) [codeworksheetsdiscussion]:  “… we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of 75% $\small F_1$ score to 36%; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to 7%. …”
    "... we sought to understand the factors that influence whether the model will be robust to adversarial perturbations on a particular example. First, we found that models do well when the question has an exact $\small n$-gram match with the original paragraph. Figure 3
    plots the fraction of examples for which an $\small n$-gram in the question appears verbatim in the original passage; this is much higher for model successes. For example, 41.5% of *BiDAF Ensemble* successes had a 4-gram in common with the original paragraph, compared to only 21.0% of model failures."

TACRED. The TACRED relation extraction dataset was introduced by Zhang et al. in their paper, Position-aware Self-attention with Relative Positional Encodings for Slot Filling  [code]. See also Position-aware Attention and Supervised Data Improve Slot Filling. TACRED is a large (106,264 examples) supervised relation extraction dataset, obtained via crowdsourcing and targeted towards TAC KBP relations.

Winograd Schema Challenge.

The Winograd Schema Challenge (WSC; Wikipedia description) poses a set of multiple-choice questions that have a particular form. Two examples follow; the second, from which the WSC gets its name, is due to Terry Winograd.

    The trophy would not fit in the brown suitcase because it was too big (small).  What was too big (small)?
    Answer 0: the trophy
    Answer 1: the suitcase

    The town councilors refused to give the demonstrators a permit because they feared (advocated) violence.  Who feared (advocated) violence?
    Answer 0: the town councilors
    Answer 1: the demonstrators

The answers to the questions (in the above examples, 0 for the sentences if the bolded words are used; 1 for the sentences if the italicized words are used) are expected to be obvious to a layperson. A human who answers the first questions correctly would likely use his knowledge about the typical size of objects and his ability to do spatial reasoning to solve the first example; he would likely use his knowledge about how political demonstrations unfold and his ability to do interpersonal reasoning to solve the second example. Due to the wide variety of commonsense knowledge and commonsense reasoning that would presumably be used by humans to solve Winograd Schema problems, it was proposed during Commonsense-2013 that the Winograd Schema Challenge could be a promising method for tracking progress in automating commonsense reasoning. …

The following paper used the Winograd Challenge dataset.

Google Brain’s A Simple Method for Commonsense Reasoning (Jun 2018) [codeslides  |  local copydiscussiondiscussion] presented a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to the method was the use of an array of large RNN language models that operated at word or character level, trained on a massive amount of unlabeled data, to score multiple choice questions posed by commonsense reasoning tests.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

  • “A unique feature of Winograd Schema questions is the presence of a special word that decides the correct reference choice. In the above example, ‘big’ is this special word. When ‘big’ is replaced by ‘small’, the correct answer switches to ‘the suitcase’. Although detecting this feature is not part of the challenge, further analysis shows that our system successfully discovers this special word to make its decisions in many cases, indicating a good grasp of commonsense knowledge.”

  • This paper was subsequently savaged in an October, 2018 commentary, A Simple Machine Learning Method for Commonsense Reasoning? A Short Commentary on Trinh & Le (2018):

    A Concluding Remark. The data-driven approach in AI has without a doubt gained considerable notoriety in recent years, and there are a multitude of reasons that led to this fact. While the data-driven approach can provide some useful techniques for practical problems that require some level of natural language processing (text classification and filtering, search, etc.), extrapolating the relative success of this approach into problems related to commonsense reasoning, the kind that is needed in true language understanding, is not only misguided, but may also be harmful, as this might seriously hinder the field, scientifically and technologically.”

Machine Reasoning Datasets

AI2 Reasoning Challenge (ARC). The Allen Institute for Artificial Intelligence (AI2) presented a question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. The ARC dataset contains 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. See Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge;  [projectcode].

  • “Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community.”

  • ARC was recently used in Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Scientific Question Answering (Aug 2018) by authors at UC San Diego and Microsoft AI Research. Existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper, the authors proposed a retriever-reader model that learned to attend on [via self-attention layers] essential terms during the question answering process via an essential-term-aware “retriever” which first identified the most important words in a question, then reformulated the queries and searches for related evidence, and an enhanced “reader” to distinguish between essential terms and distracting words to predict the answer. On the ARC dataset their model outperformed the existing state of the art [e.g., BiDAF] by 8.1%.

bAbI;  [code] is a set of 20 question answering tasks for testing text understanding and reasoning. The dataset is composed of a set of contexts, with multiple question answer pairs available based on the contexts.

  • “bAbI is a synthetic reading comprehension dataset, created by Facebook AI researchers in 2015. The term synthetic data refers to data that is not extracted from a book or from the internet, but is generated by using a few rules that simulate natural language. This characteristic of bAbI places the weight of the task on the reasoning module rather than the understanding module. Question Answering data sets provide synthetic tasks for the goal of helping to develop learning algorithms for understanding and reasoning.” [Source.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

bAbI-10k English (Weston et al., 2015a) is a synthetic dataset which features 20 different tasks. Each example is composed of a set of facts, a question, the answer, and the supporting facts that lead to the answer. The dataset comes in two sizes, referring to the number of training examples each task has: bAbI-1k and bAbI-10k. See Weston et al. (Facebook AI Research), Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (Dec 2015).

CLEVR is a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. “We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Natural Questions Dataset

Introduced in A BERT  Baseline for the Natural Questions (Google Research; Jan 2019, updated Mar 2019) [codeNatural Questions dataset;  Natural Questions leaderboard].

  • Natural Questions dataset [Google Research]:

    “A core goal in artificial intelligence is to build systems that can read the web, and then answer complex questions about any topic. These question-answering (QA) systems could have a big impact on the way that we access information. Furthermore, open-domain question answering is a benchmark task in the development of Artificial Intelligence, since understanding text and being able to answer questions about it is something that we generally associate with intelligence.

    “To help spur development in open-domain question answering, we have created the Natural Questions (NQ) corpus, along with a challenge website based on this data. The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets.

    “To view some examples, please go to the visualization page. For a full description of the methodology used to create the corpus, see Natural Questions: A Benchmark for Question Answering Research.”


    [Image source. Click image to open in new window.]

Natural Language Inference Datasets

Breaking NLI is a new natural language inference (NLI) dataset, described by Glockner et al. [Yoav Goldberg] in Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (May 2018) [GitHub]:

  • “We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

MultiNLI corpus, described in Multi-Genre Natural Language Inference, is a newer natural language inference (NLI) corpus than SNLI. MultiNLI is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information: entailment, contradiction and neutral. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

  • Unlike the SNLI corpus, which draws the premise sentence from image captions, MultiNLI consists of sentence pairs from ten distinct genres of both written and spoken English (e.g. fiction, government text or spoken telephone conversations). The dataset is divided into training (392,702 pairs), development (20,000 pairs) and test sets (20,000 pairs).

  • All of the genres are included in the test and development sets, but only five are included in the training set. The development and test datasets have been divided into matched and mismatched, where the former includes only sentences from the same genres as the training data, and the latter includes sentences from the remaining genres not present in the training data.

  • In addition to the training, development and test sets, MultiNLI provides a smaller annotation dataset, which contains approximately 1000 sentence pairs annotated with linguistic properties of the sentences and is split between the matched and mismatched datasets. [The annotated dataset and description of the annotations are available at multinli_1.0_]

  • This annotation dataset provides a simple way to assess what kind of sentence pairs an NLI system is able to predict correctly and where it makes errors.

  • Source for the points, above: Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture.

  • MultiNLI is modeled after SNLI. The two corpora are distributed in the same formats, and for many applications, it may be productive to treat them as a single, larger corpus.

  • MultiNLI is described in A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (Apr 2017; updated Feb 2018).


    [Image source. Click image to open in new window.]

  • “The MultiNLI premise sentences are derived from ten sources [genres] of freely available text which are meant to be maximally diverse and roughly represent the full range of American English:

    • FACE-TO-FACE: transcriptions from the Charlotte Narrative and Conversation Collection of two-sided, in-person conversations that took place in the early 2000s;
    • GOVERNMENT: reports, speeches, letters, and press releases from public domain government websites;
    • LETTERS: letters from the Indiana Center for Intercultural Communication of Philanthropic Fundraising Discourse written in the late 1990s-early 2000s;
    • 9/11: the public report from the National Commission on Terrorist Attacks Upon the United States released on July 22, 2004 2;
    • OUP: five non-fiction works on the textile industry and child development published by the Oxford University Press);
    • SLATE: popular culture articles from the archives of Slate Magazine written between 1996-2000;
    • TELEPHONE: transcriptions from University of Pennsylvania’s Linguistic Data Consortium Switchboard corpus of two-sided, telephone conversations that took place in 1990 or 1991;
    • TRAVEL: travel guides published by Berlitz Publishing in the early 2000s;
    • VERBATIM: short posts about linguistics for non-specialists from the Verbatim archives written between 1990 and 1996; and
    • FICTION: for our tenth genre, we compile several freely available works of contemporary fiction written between 1912 and 2010 spanning various genres including mystery, humor, science fiction, and adventure.

  • Breaking NLI updates MultiNLI, which updates SNLI.

SNLI (Stanford Natural Language Inference Corpus. Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise $\small p$ (e.g. a piece of text) and a hypothesis $\small h$. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI) Corpus, contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels “entailment,” “contradiction,” and “neutral,” supporting the task of NLI.

Natural Language Processing Datasets

The Billion Word dataset contains 768M word tokens and has a vocabulary of about 800K word types, which corresponds to words with more than 3 occurrences in the training set.

  • Paper: Chelba et al. [… Tomas Mikolov …], One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling (Google: Dec 2013; updated Mar 2014):

    “We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. …”

Gigaword.  The English Gigaword is a sentence summarization dataset based on Annotated Gigaword (Napoles et al., 2012), a dataset consisting of sentence pairs, which are the first sentence of the collected news articles and the corresponding headlines. Those data were preprocessed by Rush et al. (2015) (Jason Weston/Facebook AI Research; Sep 2015) with 3.8M sentence pairs for training, 8K for validation and 2K for testing.

  • “For training data for both tasks, we (Rush et al. (2015) utilize the annotated Gigaword data set (Graff et al., 2003; Napoles et al., 2012), which consists of standard Gigaword, preprocessed with Stanford CoreNLP tools (Manning et al., 2014). Our model only uses annotations for tokenization and sentence separation, although several of the baselines use parsing and tagging as well. Gigaword contains around 9.5 million news articles sourced from various domestic and international news services over the last two decades.”

MS-COCO.  COCO is a large-scale object detection, segmentation, and captioning dataset. See Microsoft COCO: Common Objects in Context (Feb 2015). “We present a new dataset with the goal of advancing the state of the art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. …”

Penn Treebank (PTB).

  • Preprocessed by Mikolov et al. in Recurrent Neural Network Based Language Model (Sep 2010) [local copy]. Old (still extant: 2018-10-18) project page: RNNLM Toolkit  (search GitHub, “rnnlm toolkit.

  • From Pointer Sentinel Mixture Models (Sep 2016):

    “In order to compare our model to the many recent neural language models, we conduct word-level prediction experiments on the Penn Treebank (PTB) dataset (Marcus et al., 1993), pre-processed by Mikolov et al. (2010). The dataset consists of 929k training words, 73k validation words, and 82k test words. As part of the pre-processing performed by Mikolov et al. (2010):

    • words were lower-cased,
    • numbers were replaced with $\small N$,
    • newlines were replaced with $\small \langle eos \rangle$, and
    • all other punctuation was removed.

    “The vocabulary is the most frequent 10k words with the rest of the tokens being replaced by an $\small \langle unk \rangle$ token. For full statistics, refer to Table 1 [below].”


    [Image source. Click image to open in new window.]

    Reasons for a New Dataset [WikiText-103].

    “While the processed version of the Penn Treebank has been frequently used for language modeling, it has many limitations. The tokens in PTB are all lower case, stripped of any punctuation, and limited to a vocabulary of only 10k words. These limitations mean that the PTB is unrealistic for real language use, especially when far larger vocabularies with many rare words are involved. Fig. 3
    illustrates this using a Zipfian plot over the training partition of the PTB. The curve stops abruptly when hitting the 10k vocabulary. Given that accurately predicting rare words, such as named entities, is an important task for many applications, the lack of a long tail for the vocabulary is problematic.”

SciERC. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction  [data/code] introduced a joint multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. They created “SciERC,” a dataset that included annotations for all three tasks, and developed a unified framework called Scientific Information Extractor (SciIE) with shared span representations. [SciIE was able to automatically organize the extracted information from a large collection of scientific articles into a knowledge graph. …]

SciTail. The SciTail dataset is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs. Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis. The dataset is divided into training (23,596 pairs), development (1,304 pairs) and test sets (2,126 pairs). Unlike the SNLI and MultiNLI datasets, SciTail uses only two labels: entailment and neutral .

WikiText-103. The training data of WikiText-103 comprises about 100M tokens and a vocabulary of around 260K, corresponding to types with more than 3 occurrences in the training data. The dataset is composed of shuffled Wikipedia articles where the context carries across sentences.


[Image source. Click image to open in new window.]

  • Paper: Merity et al. [Richard Socher | MetaMind/Salesforce] Pointer Sentinel Mixture Models (Sep 2016):

    “… While the processed version of the Penn Treebank has been frequently used for language modeling, it has many limitations. The tokens in PTB are

    • all lower case,
    • stripped of any punctuation, and
    • limited to a vocabulary of only 10k words.

    “These limitations mean that the PTB is unrealistic for real language use, especially when far larger vocabularies with many rare words are involved. Fig. 3
    illustrates this using a Zipfian plot over the training partition of the PTB. The curve stops abruptly when hitting the 10k vocabulary. Given that accurately predicting rare words, such as named entities, is an important task for many applications, the lack of a long tail for the vocabulary is problematic.”

    Construction and Preprocessing

    “We selected articles only fitting the ‘Good’ or ‘Featured’ article criteria specified by editors on Wikipedia. These articles have been reviewed by humans and are considered well written, factually accurate, broad in coverage, neutral in point of view, and stable. This resulted in 23,805 Good articles and 4,790 Featured articles. The text for each article was extracted using the Wikipedia API. Extracting the raw text from Wikipedia mark-up is nontrivial due to the large number of macros in use. These macros are used extensively and include metric conversion, abbreviations, language notation, and date handling.

    “Once extracted, specific sections which primarily featured lists were removed by default. Other minor bugs, such as sort keys and Edit buttons that leaked in from the HTML, were also removed. Mathematical formulae and $\small \LaTeX$ code, were replaced with $\small \langle formula \rangle$ tokens. Normalization and tokenization were performed using the Moses tokenizer (Koehn et al., 2007), slightly augmented to further split numbers $\small (8,600 → 8 @,@ 600)$ and with some additional minor fixes. Following Chelba et al. (2013) a vocabulary was constructed by discarding all words with a count below 3. Words outside of the vocabulary were mapped to the $\small \langle unk \rangle$ token, also a part of the vocabulary.

    “To ensure the dataset is immediately usable by existing language modeling tools, we have provided the dataset in the same format and following the same conventions as that of the PTB dataset.”

Other Datasets

Bio-1kb  |  Bio-1kb-Hic  Although it gives the wrong citation (Belkin and Niyogi, 2001 – which ridiculously predates the Hi-C method), Adaptive Edge Features Guided Graph Attention Networks gives a good description of this dataset:

  • “One advantage of our EGAT model is to utilize the real-valued edge weights as features rather than simple binary indicators. To test the effectiveness of edge weights, we conduct experiments on a biological network: Bio1kb-Hic (Belkin and Niyogi 2001), which is a chromatin interaction network collected by chromosome conformation capture (3C) based technologies with 1kb resolution. The nodes in the network are genetic regions, and edges are strengths of interactions between two genetic regions. The dataset contains 7,023 nodes and 3,087,019 edges. The nodes are labeled with 18 categories, which indicate the genetic functions of genetic regions. There are no attributes associated with the nodes. We split the nodes into 3 subsets with sizes 75%, 15% and 15% for training, validation and testing, respectively. Note that the genetic interactions measured by 3C-based technologies are very noisy. Therefore, the Bio1kb-Hic dataset is highly dense and noisy. To test the noise-sensitivity of the algorithms, we generate a denoised version of the dataset by setting edge weights less than 5 to be 0. In the denoised version, the number of edges is reduced from 3,087,019 to 239,543 – which is much less dense.”

  • Datasets:

    “We applied NE to a Hi-C [chromosome conformation capture
    ] interaction networks. Hi-C is a 3C-based technology that allows measurement of pairwise chromatin interaction frequencies within a cell population. Hi-C read data can be thought of as a network where genomic regions are nodes and the normalized read counts mapped to two bins are weighted edges. Visual inspection of the Hi-C contact matrix before and after Hi-C network is denoised using NE reveals an enhancement of edges within each community and sharper boundaries between communities (figure below). This improvement is particularly clear for the 5kb resolution data, where communities that were visually undetectable in the raw data become clear after denoising with NE.”