Copyright Notice

Copyright © 2018-present, Victoria A. Stuart

The material contained in this review, held in a username and password protected directory, is private and confidential. Time-limited access is granted to individuals for review purposes only, to whom Dr. Victoria A. Stuart, Ph.D. ( has directly granted access.

Please do not copy or distribute this work.

[Table of Contents]


The past several years have seen stunning advances in machine learning (ML) and natural language processing (NLP). In this TECHNICAL REVIEW I survey leading approaches to ML and NLP applicable to the biomedical domain, with a particular focus on:

  • construction of structured knowledge stores (commonsense knowledge):

    • textual knowledge stores (ideal for storing reference material)

    • knowledge graphs (ideal for storing relational data)

  • natural language understanding:

    • word embeddings (applicable to representation learning, relation extraction/link prediction, language understanding, and knowledge discovery):

    • natural language models (to better understand and leverage natural language and text);

    • natural language inference (recognizing textual entailment: identifying the relationship between a premise and a hypothesis)

    • reading comprehension (ability to process text, understand its meaning, and integrate it with preexisting knowledge);

    • commonsense reasoning (ability to make presumptions about the type and essence of ordinary situations);

    • question answering and recommendation (addressing user inquiry);

  • information overload (including text classification and summarization);

  • transfer learning and multi-task learning (leveraging existing knowledge and models for new tasks);

  • explainable/interpretable models (rendering machine learning decisions/output transparent to human understanding);

  • in silico modeling (using computer models to model biochemical, biomolecular, pharmacological and physiological processes).

Particular attention will be placed on advances in NLP and ML that are applicable to biomolecular and biomedical studies, and clinical science.

Please note the following.

  • The terms “machine learning” and “neural networks” are used interchangeably, and for convenience I will often repeat key abbreviation definitions in various subsections.

  • I will paraphrase references inline with relevant URLs provided, generally forgoing the use of author names etc. but occasionally mentioning key individuals and dates.

  • Internal references to other parts of this REVIEW (and my Glossary) are presented as green hyperlinks.

  • Please refer to my Glossary (and this glossary) for descriptions of some of the terms and concepts used in this REVIEW.

A recent review (Stanford University, July 2018) that provides an excellent introduction and overview of many of the topics discussed below is Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities.

[Table of Contents]


Natural language processing (NLP), a branch of machine learning (ML), is foundational to all information extraction and natural language tasks. Recent reviews of NLP relevant to this TECHNICAL REVIEW include:

Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on, RSS feeds, and other sources. I firmly believe that this rapid pace of progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).

Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress  (alternate link).

Basic steps associated with NLP include text retrieval and preprocessing steps, including:

Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.

Some recent ML approaches to NLP tasks include:

Cross-sentence n-ary relation extraction detects relations among n entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG, adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM  [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given

“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”

… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state-of-the-art system of Peng et al. (2017) by 1.2%.

Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified with regard to the edge labels (discussed by Song et al. in their paper).

[Table of Contents]


Machine learning is particularly well suited to assisting and even supplanting many standard NLP approaches (for a good review see Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities). Language models for example, provide improved understanding of the semantic content and latent (hidden) relationships in documents. Machine based natural language understanding (NLU) is a fundamental requirement for robust, human level performance in tasks such as information retrieval, text summarization, question answering, textual entailment, sentiment analysis, reading comprehension, commonsense reasoning, recommendation, etc.

Advances in NLU offer tremendous promise for the analysis of biomedical and clinical text, which due to the use of technical, domain-specific jargon is particularly challenging for traditional NLP approaches. Some of these challenges and difficulties are described in the August 2018 post NLP’s Generalization Problem, and How Researchers are Tackling It  [discussion].

Recent developments in NLP and ML that I believe are particularly important to advancing NLU include:

  • understanding the susceptibility of QA systems to adversarial challenge;

  • the development of deeply-trained language models;

  • transfer learning and multi-task learning;

  • reasoning over graphs;

  • the development of more advanced memory and attention-based architectures; and,

  • incorporating external memory mechanisms (e.g. a differentiable neural computer, which is essentially an updated version of a neural Turing machine: see this discussion).

The use of memory (the ability to recall previous facts and knowledge) is a key requirement to NLU, and commonsense and relational reasoning (the process of understanding the ways in which entities are connected and using this understanding to accomplish some higher order goal). Working memory is an essential component of reasoning – the process of forming an answer a new question by manipulating previously acquired knowledge (From Machine Learning to Machine Reasoning).

Computational approaches to memory may be broadly classified as

  • internal, e.g. transient “short-term” memories algorithmically generated within RNN, LSTM, dynamic memory networks, and self-attention modules; and,

  • external, e.g., fixed, long term memory stored in knowledge bases and knowledge graphs.

For example, recurrent neural networks (RNN), long-short-term memory (LSTM), dynamic memory networks (DMN), etc. can serve as working memory in summarization, question answering and other tasks (all discussed below).

DeepMind’s recent paper Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies addressed preserving and reusing past knowledge (memory) via unsupervised representation learning using a variational autoencoder: VASE (Variational Autoencoder with Shared Embeddings). VASE automatically detected shifts in data distributions and allocated spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting: “… thanks to learning a generative model of the observed environments, we can prevent catastrophic forgetting by periodically “hallucinating” (i.e. generating samples) from past environments using a snapshot of VASE, and making sure that the current version of VASE is still able to model these samples. A similar “dreaming” feedback loop was used in Lifelong Generative Modeling

  • For similar, prior work by other authors (cited) that also used a variational autoencoder, see Lifelong Generative Modeling.

  • As noted, each of the papers cited above addressed the issue of catastrophic forgetting. Interestingly, the Multitask Question Answering Network (MQAN), described in Richard Socher’s “decaNLP/MQAN” paper, attained robust multitask learning, performing nearly as well or better in the multitask setting as in the single task setting for each task despite being capped at the same number of trainable parameters in both. … This suggested that MQAN successfully used trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limited catastrophic forgetting.

Google Brain’s A Simple Method for Commonsense Reasoning  [code] presented a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to the method was the use of an array of large RNN language models that operated at word or character level, trained on a massive amount of unlabeled data, to score multiple choice questions posed by commonsense reasoning tests.

A Simple Neural Network Module for Relational Reasoning  [DeepMind blog;  non-author code here and here;  discussion herehere and here] by DeepMind described “Relation Networks,” a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning including visual question answering, text-based question answering using the bAbI suite of tasks, and complex reasoning about dynamic physical systems. They showed that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with relational networks, to implicitly discover and learn to reason about entities and their relations.

[Table of Contents]

Word Embeddings

Embeddings (word embeddings) is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers (Sebastian Ruder provides a good overview; see also this excellent post, Introduction to Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov (2013) Efficient Estimation of Word Representations in Vector Space].

Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g.,

    vBerlin - vGermany + vFrance ~ vParis
    vking - vman + vwoman ~ vqueen.

The original work for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes.” The full reference is: Y. Bengio et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pre-trained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pre-trained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – I’m being rather critical here – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.

For a theoretical discussion of word vectors, see Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline  [pdfcodediscussion], which is a critique/extension of A Latent Variable Model Approach to PMI-based Word Embeddings. In addition to proposing a new generative model – a dynamic version of the log-linear topic model of Mnih and Hinton (2007) [Three New Graphical Models for Statistical Language Modelling] – the paper provided a theoretical justification for nonlinear models like PMI, word2vec, and GloVe. It also helped explain why low dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013)  [see the algebraic examples, above]. Experimental support was provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.

Related, Sebastion Ruder recently provided a summary of ACL 2018 highlights, including a subsection entitled Understanding Representations: “It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture.”

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New Perspectives;  see also the excellent blog posts Deep Learning, NLP, and Representations by Chris Olah, and An introduction to representation learning by Michael Alcorn). Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning). In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

  • the meanings of words;
  • how words are combined to form concepts (i.e., syntax); and,
  • how concepts relate to the task at hand.

[Table of Contents]

Addressing Hypernymy and Polysemy with Word Embeddings

Word embeddings have many uses in NLP. For example, polysemy – words or phrases with different, but related, meanings [e.g. “Washington” may refer to “Washington, DC” (location) or “George Washington” (person)] – pose one of many challenges to NLP. Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal.

In Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation  [code] the authors offered a solution to the polysemy problem. They proposed a novel embedding method specifically designed for named entity disambiguation that jointly mapped words and entities into the same continuous vector space. Since similar words and entities were placed close to one another in vector space in this model, the similarity between any pair of items (e.g. words, entities, and a word and an entity) could be measured by simply computing their cosine similarity.

Though not cited in that paper, the code for that work (by coauthor and Studio Ousia employee Ikuya Yamada) was made available on GitHub in the Wikipedia2Vec repository; the Wikipedia2Vec project page contains the pretrained embeddings (models).

A probabilistic extension of fastText, Probabilistic FastText for Multi-Sense Word Embeddings (discussed elsewhere in this REVIEW) can produce accurate representations of rare, misspelt, and unseen words. Probabilistic FastText achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words:

  • “Our multimodal word representation can also disentangle meanings, and is able to separate different senses in foreign polysemies. In particular, our models attain state-of-the-art performance on SCWS, a benchmark to measure the ability to separate different word meanings, achieving 1.0% improvement over a recent density embedding model W2GM (Athiwaratkun and Wilson, 2017). To the best of our knowledge, we are the first to develop multi-sense embeddings with high semantic quality for rare words.”

  • “… we show that our probabilistic representation with subword mean vectors with the simplified energy function outperforms many word similarity baselines and provides disentangled meanings for polysemies.”

  • “We show that our embeddings learn the word semantics well by demonstrating meaningful nearest neighbors. Table 1 shows examples of polysemous words such as ‘rock’, ‘star’, and ‘cell’. Table 1 shows the nearest neighbors of polysemous words. We note that subword embeddings prefer words with overlapping characters as nearest neighbors. For instance, ‘rock-y’, ‘rockn’, and ‘rock’ are both close to the word ‘rock’. For the purpose of demonstration, we only show words with meaningful variations and omit words with small character-based variations previously mentioned. However, all words shown are in the top-100 nearest words. We observe the separation in meanings for the multi-component case; for instance, one component of the word ‘bank’ corresponds to a financial bank whereas the other component corresponds to a river bank. The single-component case also has interesting behavior. We observe that the subword embeddings of polysemous words can represent both meanings. For instance, both ‘lava-rock’ and ‘rock-pop’ are among the closest words to ‘rock’.”

Wasserstein is All you Need  [discussion] proposed a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding). Their method gives a novel perspective for building rich and powerful feature representations that simultaneously capture uncertainty (via a distributional estimate) and interpretability (with the optimal transport map). Among their various applications (e.g. entailment detection; semantic similarity), they proposed to represent sentences as probability distributions to better capture the inherent uncertainty and polysemy, arguing that histograms (or probability distributions) over embeddings allows the capture of more of this information than point-wise embeddings, alone. They discuss hypernymy detection in Section 7; for this purpose, they relied on a recently proposed model that which explicitly modeled what information is known about a word by interpreting each entry of the embedding as the degree to which a certain feature is present.

  • “While existing methods represent each entity of interest (e.g., a word) as a single point in space (e.g., its embedding vector), we here propose a fundamentally different approach. We represent each entity based on the histogram of contexts (co-occurring with it), with the contexts themselves being points in a suitable metric space. This allows us to cast the distance between histograms associated with the entities as an instance of the optimal transport problem [see Section 3 for a background on optimal transport]. For example, in the case of words as entities, the resulting framework then intuitively seeks to minimize the cost of moving the set of contexts of a given word to the contexts of another [note their Fig, 1]. Note that the contexts here can be words, phrases, sentences, or general entities co-occurring with our objects to be represented, and these objects further could be any type of events extracted from sequence data …”

  • Regarding semantic embedding, or word sense disambiguation (not explicitly discussed in the paper), their Fig.2 [Illustration of three words, each with their distributional estimates (left), as well as the point estimates of the relevant contexts (middle), as well as joint representation (right)] is very interesting: words in vector space, along with a histogram of their probability distributions over those embedded spaces.

  • “Software Release. We plan to make all our code (for all these parts) and our pre-computed histograms (for the mentioned datasets) publicly available on GitHub soon.”

Related to polysemy and named entity disambiguation is word sense disambiguation (WSD). Learning Graph Embeddings from WordNet-based Similarity Measures (discussed elsewhere in this REVIEW]) described a new approach, path2vec, for learning graph embeddings that relied on structural measures of node similarities for generation of training data. Evaluations of the proposed model on semantic similarity and WSD tasks showed that path2vec yielded state of the art results.

In 2018 Ruslan Salakhutdinov and colleagues proposed a probabilistic graphical model (discussed elsewhere in this REVIEW) that leveraged a topic model to design a WSD system (WSD-TM) that scaled linearly with the number of words in the context (Knowledge-based Word Sense Disambiguation using Topic Models). Their logistic normal topic model – a variant of latent Dirichlet allocation in which the topic proportions for a document were replaced by WordNet synsets (sets of synonyms) – incorporated semantic information about synsets as its priors. WSD-TM outperformed state of the art knowledge-based WSD systems.

Early in 2018 deeply trained language models such as ELMo (Deep Contextualized Word Representations;  discussed elsewhere in this REVIEW) offered another approach to solve the polysemy problem.

[Table of Contents]

Applications of Embeddings in the Biological Sciences

While predicting protein 3D structure from primary amino acid sequences has been a long-standing objective in bioinformatics, definite solutions remain to be found  [discussion]. The most reliable approaches currently available involve homology modeling, which allows assigning a known protein structure to an unknown protein, provided that there is detectable sequence similarity between the two. When homology modeling is not viable de novo techniques, based on physical-based potentials or knowledge-based potentials, are needed. Unfortunately proteins are very large molecules, and the huge amount of available conformations, even for relatively small proteins, makes it prohibitive to fold them even on customized computer hardware.

To address this challenge, knowledge based potentials can be learned from statistics or machine learning methods to infer useful information from known examples of protein structures. This information can be used to constrain the problem, greatly reducing the amount of samples that need to be evaluated when dealing exclusively with physics-based potentials. Multiple sequence alignments (MSA) consists of aligned sequences homologous to the target protein, compressed into position-specific scoring matrices (PSSM, also called sequence profiles) using the fraction of occurrences of different amino acids in the alignment for each position in the sequence. More recently, contact map prediction methods have been at the center of renewed interest; however, their impressive performance is correlated with the amount of sequences in the MSA, and is not as reliable when few sequences are related to the target.

rawMSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete introduced a new approach, called rawMSA, for the de novo prediction of structural properties of proteins. The core idea behind rawMSA was to borrow from the word2vec word embedding from Mikolov et al. (Efficient Estimation of Word Representations in Vector Space), which they used to convert each character (amino acid residue) in the MSA into a floating point vector of variable size, thus representing the residues by the structural property they were trying to predict. Test results from deep neural networks based on this concept showed that rawMSA matched or outperformed the state of the art on three tasks: predicting secondary structure, relative solvent accessibility, and residue-residue contact maps.

[Table of Contents]

Memory-Based Architectures

Internal Memory Mechanisms

PUFFIN Long short-term memory (LSTM), a type of recurrent neural network (RNN) capable of learning long-term dependencies, are commonly employed for memory purposes in the various models discussed in this REVIEW. While on the surface LSTM based approaches generally appear to perform well for memory and recall, upon deeper inspection they can also display significant limitations. For example, my cursory PRELIMINARY RESULT using the BiDAF/SQuAD question answering online demo found that the BiDAF model (discussed elsewhere in this REVIEW) performed well on some queries, but failed on other semantically and syntactically identical questions (e.g. with changes in character case, or punctuation), as well as queries on entities not present in the text. While BiDAF a employed hierarchical multi-stage process consisting of six layers (character embedding, word embedding, contextual embedding, attention flow, modeling and output layers), it employed GloVe pre-trained word vectors for the word embedding layer to map each word to a high-dimensional vector space (a fixed embedding of each word). This led me to suspect that the shallow embeddings encoded in the GloVe pre-trained word vectors failed to capture the nuances of processed text.

[excerpted/paraphrased from NLP’s ImageNet moment has arrived]:

“Pretrained word vectors have brought NLP a long way. Proposed in 2013 as an approximation to language modeling, word2vec found adoption through its efficiency and ease of use … word embeddings pretrained on large amounts of unlabeled data via algorithms such as word2vec and GloVe are used to initialize the first layer of a neural network, the rest of which is then trained on data of a particular task. … Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.

“Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

“At the core of the recent advances of ULMFiT, ELMo, and the OpenAI Transformer is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.”

Those issues were perhaps also suggestive of shortcomings in the BiDAF encoding, decoding, and/or attention modules. Recent discussions by Stanford University researchers (Adversarial Examples for Evaluating Reading Comprehension Systems  [codediscussion]) are highly apropos to this issue, motivating University of Chicago/Google/Google Brain researchers (Did the Model Understand the Question?  [code]). Adversarial challenges to SQuAD1.1 (e.g., by adding adversarially inserted sentences to text passages without changing the correct answer or misleading humans) easily distracted recurrent neural network/attentional-based algorithms like BiDAF and LSTM, leading to incorrect answers. Additionally, although deep learning networks were quite successful overall, they often ignored important question terms and were easily perturbed by adversarial-modified content – again giving incorrect answers. PUFFIN Unquestionably, LSTM based language models have been important drivers of progress in NLP (On the State of the Art of Evaluation in Neural Language ModelsRegularizing and Optimizing LSTM Language Models  [code]; for an older but superb review see Andrej Karpathy’s The Unreasonable Effectiveness of Recurrent Neural Networks). However, as these models are increasingly employed for textual summarization, QA, NLU, natural language inference, and commonsense reasoning tasks NLP researchers are also increasingly questioning both the relevance and performance of RNN/LSTM for learning natural language. Sebastian Ruder included these comments in his recent summary of ACL 2018 highlights:

Another way to gain a better understanding of a [NLP] model is to analyze its inductive bias. The “Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP” (RELNLP) sought to explore how useful it is to incorporate linguistic structure into our models. One of the key points of Chris Dyer’s talk during the workshop was whether RNNs have a useful inductive bias for NLP. In particular, he argued that there are several pieces of evidence indicating that RNNs prefer sequential recency, namely:

  • Gradients become attenuated across time. LSTMs or GRUs may help with this, but they also forget.
  • People have used training regimes like reversing the input sequence for machine translation.
  • People have used enhancements like attention to have direct connections back in time.
  • For modeling subject-verb agreement, the error rate increases with the number of attractors (Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies)

According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don’t seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour. Recurrent neural network grammars, a class of models that generates both a tree and a sequence sequentially by compressing a sentence into its constituents, instead have a bias for syntactic (rather than sequential) recency (Recurrent Neural Network Grammars). However, it can often be hard to identify whether a model has a useful inductive bias. For identifying subject-verb agreement, Chris hypothesizes that LSTM language models learn a non-structural “first noun” heuristic that relies on matching the verb to the first noun in the sentence. In general, perplexity (and other aggregate metrics) are correlated with syntactic/structural competence, but are not particularly sensitive at distinguishing structurally sensitive models from models that use a simpler heuristic.

Understanding the failure modes of LSTMs

Better understanding representations was also a theme at the Representation Learning for NLP workshop. During his talk, Yoav Goldberg detailed some of the efforts of his group to better understand representations of RNNs. In particular, he discussed recent work on extracting a finite state automaton from an RNN in order to better understand what the model has learned (Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples). He also reminded the audience that LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data. Even when a model has been trained using a domain-adversarial loss to produce representations that are invariant of a certain aspect, the representations will be still slightly predictive of said attribute. It can thus be a challenge to completely remove unwanted information from encoded language data and even seemingly perfect LSTM models may have hidden failure modes. On the topic of failure modes of LSTMs, a statement that also fits well in this theme was uttered by this year’s recipient of the ACL lifetime achievement award, Mark Steedman. He asked ‘LSTMs work in practice, but can they work in theory?’

A UC-Berkeley paper by John Miller and Moritz Hardt, When Recurrent Models Don’t Need To Be Recurrent  [author’s discussiondiscussion), studied the gap between recurrent and feed-forward models trained using gradient descent. They proved that stable RNN are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. If the recurrent model is stable (meaning the gradients can not explode), then the model can be well-approximated by a feed-forward network for the purposes of both inference and training. In other words, they showed feed-forward and stable recurrent models trained by gradient descent are equivalent in the sense of making identical predictions at test-time. [Of course, not all models trained in practice are stable: they also gave empirical evidence the stability condition could be imposed on certain recurrent models without loss in performance.]

Recurrent models feature flexibility and expressivity that come at a cost. Empirical experience shows these [RNN] models are often more delicate to tune and more brittle to train than standard feed-forward architectures. Recurrent architectures can also introduce significant computational burden compared with feed-forward implementations. In response to these shortcomings, a growing line of empirical research succeeds in replacing recurrent models effectively by feed-forward models in important applications, including translation, speech synthesis, and language modeling. In contrast to an RNN, the limited context of a feed-forward model means that it cannot capture patterns that extend more than k steps.

Although it appears trainability and parallelization for feed-forward models comes at the price of reduced accuracy, there have been several recent examples showing that feed-forward networks can actually achieve the same accuracies as their recurrent counterparts on benchmark tasks, including language modeling, machine translation, and speech synthesis. With regard to language modeling – in which the goal is to predict the next word in a document given all of the previous words – feed-forward models make predictions using only the k most recent words, whereas recurrent models can potentially use the entire document. The gated-convolutional language model is a feed-forward autoregressive model that is competitive with large LSTM baseline models. Despite using a truncation length of k=25, the model outperforms a large LSTM on the Wikitext-103 benchmark, which is designed to reward models that capture long-term dependencies. On the Billion Word Benchmark, the model is slightly worse than the largest LSTM, but is faster to train and uses fewer resources. This is perplexing, since recurrent models seem to be more powerful a priori.

When Recurrent Models Don’t Need To Be Recurrent coauthor John Miller continues his discussion in his related blog post:

One explanation for this phenomenon is given by Dauphin et al. (Language Modeling with Gated Convolutional Networks):

“The unlimited context offered by recurrent models is not strictly necessary for language modeling.”

In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

“The ‘infinite memory’ advantage of RNNs is largely absent in practice.”

As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM n-gram model with n=13 words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

Recurrent models trained in practice are effectively feed-forward.

This could happen either because truncated backpropagation time cannot learn patterns significantly longer than k steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

We know very little about how neural language models (LM) use prior linguistic context. A recent paper by Dan Jurafsky and colleagues at Stanford University, Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context investigated the role of context in a LSTM based LM, through ablation studies. On two standard datasets (Penn Treebank and WikiText-2) they found that the model was capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model was highly sensitive to the order of words within the most recent sentence, but ignored word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. They further found that the neural caching model (Improving Neural Language Models with a Continuous Cache) especially helped the LSTM copy words from within this distant context.

Paraphrased from Improving Neural Language Models with a Continuous Cache: the neural cache model augments neural language models with a longer-term memory that dynamically updates the word probabilities based on the long-term context. The neural cache stores the previous hidden states in memory cells for use as keys to retrieve their corresponding (next) word. A neural cache can be added on top of a pretrained language model at negligible cost.

External Memory Mechanisms

Relational database management systems (RDBMS), textual knowledge stores (TKS) and knowledge graphs (KG) represent well known examples of external knowledge stores, that may be leveraged as permanent external memory architectures suitable for NLP and ML.

The MemN2N architecture (End-To-End Memory Networks  [code;  non-author code herehere and here;  discussion here and here]) is a recurrent attention model over an external memory.

“Our model takes a discrete set of inputs x1, …, xn that are to be stored in the memory, a query q, and outputs an answer a. Each of the xi, q, and a contains symbols coming from a dictionary with V words. The model writes all x to the memory up to a fixed buffer size, and then finds a continuous representation for the x and q. The continuous representation is then processed via multiple hops to output a. This allows backpropagation of the error signal through multiple memory accesses back to the input during training.”

  • “… The entire set of {xi} are converted into memory vectors {mi} of dimension d computed by embedding each xi in a continuous space, in the simplest case, using an embedding matrix A (of size d × V). …”  ←  i.e., the vectorized input is stored as external memory

The model involved multiple computational steps (termed “hops”) per output symbol. In this RNN architecture, the recurrence read from a possibly large external memory multiple times before outputting a symbol. For question answering MemN2N was competitive with memory networks but with less supervision; for LM MemN2N demonstrated performance comparable to RNN and LSTM on the Penn TreeBank and Text8 datasets. In both cases they showed that the key concept of multiple computational hops yielded improved results. Unlike a traditional RNN, the average activation weight of memory positions during the memory hops did not decay exponentially: it had roughly the same average activation across the entire memory (Fig. 3 in their paper), which may have been the source of the observed improvement in LM.

“We also vary the number of hops and memory size of our MemN2N, showing the contribution of both to performance; note in particular that increasing the number of hops helps. In Fig. 3, we show how MemN2N operates on memory with multiple hops. It shows the average weight of the activation of each memory position over the test set. We can see that some hops concentrate only on recent words, while other hops have more broad attention over all memory locations, which is consistent with the idea that successful language models consist of a smoothed n-gram model and a cache. Interestingly, it seems that those two types of hops tend to alternate. Also note that unlike a traditional RNN, the cache does not decay exponentially: it has roughly the same average activation across the entire memory. This may be the source of the observed improvement in language modeling.

LSTM were also used in Augmenting End-to-End Dialog Systems with Commonsense Knowledge, which investigated the impact of providing commonsense knowledge about concepts (integrated as external memory) on human-computer conversation. Their method was based on a NIPS 2015 workshop paper, Incorporating Unstructured Textual Knowledge Sources into Neural Dialogue Systems, which described a method to leverage additional information about a topic using a simple combination of hashing and TF-IDF to quickly identify the most relevant portions of text from the external knowledge source, based on the current context of the dialogue. In that work, three recurrent neural networks (RNNs) were trained: one to encode the selected external information, one to encode the context of the conversation, and one to encode a response to the context. Outputs of these modules were combined to produce the probability that the response was the actual next utterance given the context. The model implemented in Augmenting End-to-End Dialog Systems with Commonsense Knowledge likewise integrated a large commonsense knowledge base (ConceptNet, in one example) into end-to-end conversational models. Although very confusingly/poorly described, the basic approach was as follows.

  • Concepts (keywords) were derived from each message (inputted sentence) using simple n-gram matching, where every n-gram in the concept was considered as a potential concept. If the n-gram was a key in a dictionary of the concepts derived from ConceptNet, the corresponding value or assertion, a (concept-relation-concept) relation triple, was added to the assertions dataset as external knowledge.

  • The authors employed a Tri-LSTM encoder: LSTM were used to encode message, response and commonsense assertions, where LSTM weights for message and response were tied. A component of the Tri-LSTM encoder (Fig. 3) served as the memory module, encoding all commonsense assertions. Their experiments suggested that their knowledge-augmented models were superior to their knowledge-free counterparts.

    Critique: this was a clever, but complicated, approach. It is mentioned here for its use of knowledge graphs as an external memory source in LSTM-based approaches. The approach was prone to noise due to uncertainties inherent in ConceptNet 5 where many of the assertions were considered “false,” “nonsense” or “vague” by human evaluators:

    False: 34/535 (6.4%);Nonsense: 50/535 (9.4%); Vague: 15/535 (2.8%); Don’t Know: 19/535 (3.6%); Sometimes: 117/535 (21.9%); True: 300/535 (56.1%); Total: 535.

    Source: Table 6.2 (p. 173) in Speer and Havasi (2012) “ConceptNet 5: A Large Semantic Network for Relational Knowledge” (Chapter 6 in Theory and Applications of Natural Language Processing  [local copy).

    Also, this work is conceptually identical to Yoshua Bengio’s earlier and technically superior work (A Neural Knowledge Language Model  [OpenReview], which is not cited. In that work, Bengio and colleagues proposed a neural knowledge language model which combined the symbolic knowledge provided by a knowledge graph with the expressive power of RNN language models. Like the “assertions” in Augmenting End-to-End Dialog Systems with Commonsense Knowledge, in Bengio’s work facts were represented as (subject, relationship, object) triples, a global vocabulary of frequent terms was maintained, and string matching was employed to align words in the topic description with their corresponding facts in the topic knowledge. The input consisting of a word and a fact went into LSTM; the LSTM output together with the knowledge context generated a fact key. Using the fact key, the fact embedding was retrieved from the topic knowledge memory.

In a similar approach, A Knowledge Enhanced Generative Conversational Service Agent described a system that consisted of a knowledge extractor, a knowledge encoder, and a knowledge enhanced seq2seq model including a query encoder and a decoder for response generation. The knowledge extractor was designed to obtain dialog-related knowledge from the web in a given dialog history including the latest query. In that phase, keywords from the dialog history by a keyword extraction component (RAKE; see also my PRELIMINARY WORK), which were used to capture external related knowledge from the web (the top-K most relevant results from/ranked by a search engine). A convolutional neural network (CNN) based knowledge encoder then extracted features of the obtained knowledge. In the knowledge enhanced seq2seq model, the encoder adopted a LSTM to project the dialog history and the user input into a real-valued vector, followed by the decoder which generated response based on the dialog history vector and the knowledge features.

Critique: this is a moderately clever idea, albeit prone to noise (acknowledged in their Conclusions), and other deficiencies.

Regarding external memory structures, note also the extensive work in the NLP domain regarding neural Turing machines  (NTM), and – to a lesser extent – differentiable neural computers  (DNC). For a slightly dated (current to ~2017) summary listing of NTM and DNC, see my web page (this is a huge file: on slow connections, wait for the page to fully load). Here are some papers that implement external memory structures.

  • Tracking the World State with Recurrent Entity Networks  [OpenReview; non-author code here and here], by Jason Weston and Yann LeCun, introduced the Recurrent Entity Network (EntNet). EntNet was equipped with a dynamic long-term memory, which allowed it to maintain and update a representation of the state of the world as it received new data. For language understanding tasks, it could reason on the fly as it read text, not just when it was required to answer a question or respond, as was the case for the MemN2N memory network (Weston’s End-To-End Memory Networks, discussed elsewhere in this REVIEW). Like a neural Turing machine or differentiable neural computer, EntNet maintained a fixed size memory and could learn to perform location and content-based read and write operations. However, unlike those models, it had a simple parallel architecture in which several memory locations could be updated simultaneously. EntNet set a new state of the art on the bAbI tasks, and was the first method to solve all the tasks in the 10k training examples setting. Weston and LeCun also demonstrated that EntNet could solve a reasoning task which required a large number of supporting facts, which other methods were not able to solve, and could generalize past its training horizon.

  • Survey of Reasoning using Neural Networks provided an excellent survey (including background summaries) of neural network approaches to reasoning and inference with a focus on the need for memory networks (e.g. the MemN2N end-to-end memory network, discussed elsewhere in this TECHNICAL REVIEW) and large external memories. Among the algorithms surveyed and compared are a LSTM, a NTM with a LSTM controller, and a NTM with a feedforward controller, demonstrating the superior performance of the NTM over the LSTM.

  • Robust and Scalable Differentiable Neural Computer for Question Answering  [code] – discussed above – was designed as a general problem solver which could be used in a wide range of tasks. Their GitHub repository contains an implementation of a Advanced Differentiable Neural Computer (ADNC), providing more robust and scalable use in Question Answering. The ADNC was applied to the 20 bAbI QA tasks with state of the art mean results, and to the CNN Reading Comprehension Task with passable results, without any adaptation or hyper-parameter tuning. The author’s Master’s Thesis contains additional detail.

[Table of Contents]

Attention and Working Memory

Jason Weston (Facebook AI Research)’s highly-cited End-To-End Memory Networks (discussed elsewhere in this TECHNICAL REVIEW) introduced the MemN2N neural architecture, a recurrent attention model over a possibly large external memory. The architecture was trained end-to-end and hence required significantly less supervision during training; the flexibility of the model allowed them to apply it to tasks as diverse as synthetic question answering and to language modeling. In both cases they showed that the key concept of multiple computational hops yielded improved results. [The MemN2N model involves multiple computational steps (termed “hops”) per output symbol.] Unlike a traditional RNN, the memory cache did not decay exponentially: it had roughly the same average activation across the entire memory (Fig. 3 in their paper), which may have been the source of the observed improvement in language modeling.

A recent paper from DeepMind (Relational Recurrent Neural Networks  [code; discussion here and here] is also of interest with regard to language modeling and reasoning over natural language text. While memory-based neural networks model temporal data by leveraging an ability to remember information for long periods, it is unclear whether they also have an ability to perform complex relational reasoning with the information they remember. In this paper the authors first confirmed their intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected (i.e., tasks involving relational reasoning). They then improved upon these deficits by using a new memory module, a Relational Memory Core (RMC; Fig. 1 in that paper), which showed large gains in reinforcement domains including language models (Sections 4.3 and 5.4).

  • Critique: while the DeepMind RMC model combined features of dynamic memory with an attention mechanism similar to Jason Weston’s DMN+ model (cited), they neither discuss nor compare the two models. Disappointingly, the DeepMind paper lacks ablation studies or other work to better understand their model: “… we cannot necessarily make any concrete claims as to the causal influence of our design choices on the model’s capacity for relational reasoning, or as to the computations taking place within the model and how they may map to traditional approaches for thinking about relational reasoning. Thus, we consider our results primarily as evidence of improved function - if a model can better solve tasks that require relational reasoning, then it must have an increased capacity for relational reasoning, even if we do not precisely know why it may have this increased capacity.

  • The RMC module employs multi-head dot product attention (MHDPA; also referred to as “self-attention”); this is a Google’s rebranding of their “Transformerseq2seq self-attention mechanism, first described in Attention Is All You Need.

  • An aside regarding this DeepMind Relational Recurrent Neural Networks paper: another DeepMind paper, Relational Deep Reinforcement Learning  [discussion] (released at the same time) introduced an approach for deep reinforcement learning that improved upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It used the same computationally efficient self-attention model (MHDPA, i.e. the Transformer model in Attention Is All You Need) to iteratively reason about the relations between entities in a scene and to guide a model-free policy. In these models entity-entity relations are explicitly computed when considering the messages passed between connected nodes of the graph (i.e. the relations between entities in a scene). MHDPA computes interactions between those entities (attention weights); an (underlying) graph defines the path to a solution, with the attention weights driving the solution. [Very cool.]

An Interpretable Reasoning Network for Multi-Relation Question Answering  [code] is another very interesting paper which addressed multi-relation question answering via elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. They presented a novel Interpretable Reasoning Network (IRN) model that employed an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decided which part of an input question should be analyzed at each hop, for which the reasoning module predicted a knowledge base relation (relation triple) that corresponded to the current parsed result. The predicted relation was used to update the question representation as well as the state of the reasoning module, and helped the model to make the next-hop reasoning. At each hop, an entity was predicted based on the current state of the reasoning module.

This model (IRN) yielded state of the art results on two datasets. More interestingly and different from previous models, *IRN offered traceable and observable intermediate predictions (see their Fig. 3), facilitating reasoning analysis and failure diagnosis (thereby also allowing manual manipulation in answer prediction). Whereas single-relation questions such as “How old is Obama?” can be answered by finding one fact triple in knowledge base/graph (this task has been widely studied), this work addressed multi-relation QA. Reasoning over multiple fact triples was required to answer multi-relation questions such as “Name a soccer player who plays at forward position at the club Borussia Dortmund.”, where more than one entity and relation are mentioned. On the datasets evaluated, *IRN outperformed other baseline models such as Weston’s MemN2N model (see Table 2 in the IRN paper). Through vector (space) representation, IRN could also establish reasonable mappings between knowledge base relations and natural language, such as linking “profession” to words like “working”, “profession”, and “occupation” (see their Table 4), which addresses the issue of out-of-vocabulary (OOV) words.

Working memory is an essential component of reasoning – the process of forming an answer a new question by manipulating previously acquired knowledge (From Machine Learning to Machine Reasoning). Memory modules are often implemented as a set of memory slots without explicit relational exchange of content, which does not naturally match multi-relational domains in which data is structured. Relational dynamic memory networks designed a new model, Relational Dynamic Memory Network (RDMN), to fill this gap. The memory could have single or multiple components, each of which realized a multi-relational graph of memory slots. The memory, dynamically updated in the reasoning process, was controlled by a central controller. The architecture is shown in their Fig. 1 (RDMN with single component memory): at the first step, the controller reads the query; the memory is initialized by the input graph, one node embedding per memory cell. Then during the reasoning process, the controller iteratively reads from and writes to the memory. Finally, the controller emits the output. RDMN performed well on several domains, including molecular bioactivity and chemical reactions.

  • This work builds on preliminary work, described in Graph Memory Networks for Molecular Activity Prediction  [non-author code].

  • Their Discussion provides an excellent summary (paraphrased here) that is relevant to this REVIEW:

    “The problem studied in this paper belongs to a broader program known as machine reasoning: unlike the classical focus on symbolic reasoning, here we aim for a learnable neural reasoning capability. We wish to emphasize that RDMN is a general model for answering any query about graph data. While the evaluation in this paper is limited to function calls graph, molecular bioactivity and chemical reaction, RDMN has a wide range of potential applications. For example, a drug (query) may act on the network of proteins as a whole (relational memory). In recommender systems, user can be modeled as a multi-relational graph (e.g., network between purchased items, and network of personal contacts); and query can be anything about them (e.g., preferred attributes or products). Similarly in healthcare, patient medical record can be modeled as multi-relational graphs about diseases, treatments, familial and social contexts; and query can be anything about the presence and the future of health conditions and treatments.

Collectively, the works discussed above indicate that:

  • shallow-trained models based on RNN (and perhaps also the deeply-trained Transformer LM) are likely to continue to perform poorly for NLU, without further improvement. A similar conclusion on the deeply-trained LSTM-based LM (ELMo and ULMFiT) awaits further research. In ELMo each token representation is a function of the entire input sentence, which may overcome the limitations of previous word embeddings where each word is usually modeled as an average of their multiple contexts. ELMo embeddings provide a better understanding of the contextual meaning of a word, as opposed to traditional word embeddings that are not only context-independent but have a very limited definition of context.

  • at lower model complexity and at lower computational cost, feed-forward networks achieve results on par with or better than recurrent networks;

  • gaps remain in our understanding of feed-forward networks, recurrent networks, language models, and their effects on natural language understanding;

  • other approaches may be needed, such as Richard Socher’s (SalesForce) Dynamic Memory Network (DMN+: Ask Me Anything: Dynamic Memory Networks for Natural Language ProcessingDynamic Memory Networks for Visual and Textual Question Answering), Jason Weston’s (Facebook AI Research) end-to-end memory networks (MemN2N: End-To-End Memory Networks), or DeepMind’s Relational Memory Core (RMC: Relational recurrent neural networks). Memory-augmented neural networks such as MemN2N solve a compartmentalization problem with a slot-based memory matrix but may have a harder time allowing memories to interact/relate with one another once they are encoded, whereas LSTM pack all information into a common hidden memory vector, potentially making compartmentalization and relational reasoning more difficult (Relational recurrent neural networks).

On the latter point, for an excellent discussion of attention vs. memory, see Denny Britz’ Attention and Memory in Deep Learning and NLP. Also, Attention in Long Short-Term Memory Recurrent Neural Networks discusses a limitation of the LSTM-based encode-decoder architectures (fixed-length internal representations of the input sequence) that attention mechanisms overcome, allowing the network to learn where to pay attention in the input sequence for each item in the output sequence. Particularly relevant to this REVIEW are the examples of attention in textual entailment (drawn from the DeepMind paper Reasoning about Entailment with Neural Attention  [non-author code here and here]) and text summarization (drawn from Jason Weston’s A Neural Attention Model for Abstractive Sentence Summarization) – the benefits of which are immediately obvious.

Also relevant to this discussion, researchers at the P.G. Allen School (University of Washington) in Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of S-RNN (simple RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Thus, the LSTM gates themselves are powerful recurrent models that provide more representational power than previously realized. They noted that first of all the LSTM weights are vectors, while attention typically computes scalar weights; i.e., a separate weighted sum is computed for every dimension of the LSTM’s memory cell. Second, the weighted sum is accumulated with a dynamic program. This enables a linear rather than quadratic complexity in comparison to self-attention, but reduces the amount of parallel computation. This accumulation also creates an inductive bias of attending to nearby words, since the weights can only decrease over time. Finally, attention has a probabilistic interpretation due to the softmax normalization, while the sum of weights in LSTM can grow up to the sequence length. In variants of the LSTM that tie the input and forget gate, such as coupled-gate LSTM and GRU, the memory cell instead computes a weighted average with a probabilistic interpretation. These variants compute locally normalized distributions via a product of sigmoids rather than globally normalized distributions via a single softmax. They conclude:

  • “Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicate that LSTMs suffer little to no performance loss when removing the S-RNN. This provides evidence that the gating mechanism is doing the heavy lifting in modeling context. We further ablate the recurrence in each gate and find that this incurs only a modest drop in performance, indicating that the real modeling power of LSTMs stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allows us to mathematically relate LSTMs and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enables the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.”

In the recent language modeling domain, whereas ELMo employs stacked Bi-LSTM, and ULMFiT employs stacked LSTM (with no attention, shortcut connections or other sophisticated additions), the OpenAI Transformer is a simple network architecture based solely on attention mechanisms that entirely dispenses with recurrence and convolutions. The OpenAI Transformer LM is entirely attention-based yet attains state of the art results. OpenAI Transformer is based on Google’s Transformer, introduced in Attention Is All You Need and discussed in the Google AI blog post Transformer: A Novel Neural Network Architecture for Language Understanding (Aug 2017). OpenAI Transformer surpassed the state of the art on neural machine translation tasks, and generalized well to other tasks. The Transformer model, based entirely on attention, replaced RNN with a multi-head attention that consisted of multiple attention layers. Due to the absence of recurrent layers in the model, Transformer trained significantly faster and outperformed all previously reported ensembles (Attention Is All You Need).

  • Multi-head dot product attention (MHDPA; also referred to as “self-attention”) is Google’s rebranding of their “Transformerseq2seq self-attention mechanism, first described in Attention Is All You Need.

  • Alexander Rush at HarvardNLP provides an excellent web page, The Annotated Transformer, complete with discussion and code (an “annotated” version of the paper in the form of a line-by-line implementation) [papercode]!

In July 2018, nearly a year after they introduced their Transformer architecture, Google Brain/DeepMind released an updated “Universal Transformer” version (Universal Transformers), discussed in the Google AI blog post Moving Beyond Translation with the Universal Transformer [Aug 2018;  discussion]:

  • “In Universal Transformers  [code, described in Tensor2Tensor for Neural Machine Translation] we extend the standard Transformer to be computationally universal (Turing complete) using a novel, efficient flavor of parallel-in-time recurrence which yields stronger results across a wider range of tasks. We built on the parallel structure of Transformer to retain its fast training speed, but we replaced Transformer’s fixed stack of different transformation functions with several applications of a single, parallel-in-time recurrent transformation function (i.e. the same learned transformation function is applied to all symbols in parallel over multiple processing steps, where the output of each step feeds into the next). Crucially, where an RNN processes a sequence symbol-by-symbol (left to right), Universal Transformer processes all symbols at the same time (like the Transformer), but then refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention. This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNN, and also makes the Universal Transformer more powerful than the standard feedforward Transformer. [ … snip … ]
    [Source: Moving Beyond Translation with the Universal Transformer]

    • The performance benchmarks for Universal Transformer on the bAbI dataset (especially the more difficult “10k examples”: Table 1 in that paper) are particularly impressive (note also the MemN2N comparison). Appendix C shows the bAbI attention visualizations, of which the last example is particularly impressive (requiring three supportive facts to solve).

Google AI followed their Universal Transformers paper in August 2018 with Character-Level Language Modeling with Deeper Self-Attention [discussion], which showed that a deep (64-layer) Transformer model with fixed context outperformed RNN variants by a large margin, achieving state of the art on two popular benchmarks.

  • LSTM and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts.

  • While code is not yet released (2018-08-16), it will likely appear in Google’s TensorFlow tensor2tensor GitHub repository, “home” of their Transformer code.

  • For reference, in Learning to Generate Reviews and Discovering Sentiment OpenAI also trained a byte (character)-level RNN-based language model (a single layer multiplicative LSTM with 4096 units trained for a single epoch on an Amazon product review dataset), that even with data-parallelism across 4 Pascal Titan X GPU for which training took approximately one month.

    I am unsure how long it takes to train Google’s Transformer algorithm. However, RNN/CNN handle input sequences sequentially word-by-word which is an obstacle to parallelization; Transformer achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence, which leads to significantly shorter training times [Source;  this GitHub Issue discusses parallelization over GPU and training times, etc. (results are GPU-number and batch size dependent). The Annotated Transformer also discusses this: under their setup, (8 NVIDIA P100 GPU; parametrization; …) they trained the base models for a total of 100,000 steps (12 hrs); big models were trained for 300,000 steps (3.5 days).].

Like Google, Facebook AI Research has also developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation  [code/pretrained modelsdiscussion], applied to story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation). To improve the relevance of the generated story to its prompt, they introduced a fusion mechanism (Cold Fusion: Training Seq2Seq Models Together with Language Models) where their model was trained on top of an pretrained seq2seq model. [For the first time, they showed that fusion mechanisms could help seq2seq models build dependencies between their input and output.] To improve over the pretrained model, the second model had to focus on the link between the prompt and the story. Since existing convolutional architectures only encode a bounded amount of context, they introduced a novel gated self-attention mechanism that allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context). Similar to Vaswani et al. (2017) [Google’s Transformer (Attention Is All You Need)], they used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (as in Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with Gated Linear Unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding proposed a new self-attention mechanism for sentence embedding, Dynamic Self-Attention (DSA). They designed DSA by modifying dynamic routing in capsule networks for use in NLP. DSA attends to informative words with a dynamic weight vector, achieving new state of the art results among sentence encoding methods on the Stanford Natural Language Inference (SNLI) dataset – with the least number of parameters – while showing comparative results in Stanford Sentiment Treebank (SST) dataset. With the dynamic weight vector, the self attention mechanism could be furnished with flexibility, rendering it more effective for sentence embedding.

  • Capsule networks (briefly discussed elsewhere in this review; Understanding Capsule Network Architecture provides a good overview) were introduced by Geoff Hinton in 2018 in Dynamic Routing Between Capsules.

  • In capsule network, dynamic routing iteratively computes weights over inputs by the inner product between inputs and a weighted sum of inputs. Varying with the inputs, the weighted sum detects informative inputs; therefore it can be interpreted as a dynamic weight vector from the perspective of self-attention. “We expect the dynamic weight vector to give rise to flexibility in self-attention since it can adapt to given sentences even after training.”

Learning to Compose Neural Networks for Question Answering [codeauthor discussion) presented a compositional, attentional model for answering questions about a variety of world representations including images and structured knowledge bases. The model used natural language strings to automatically assemble neural networks from a collection of composable modules. Parameters for these modules were learned jointly with network-assembly parameters via reinforcement learning, with only (world, question, answer) triples as supervision. This approach, termed a “Dynamic Neural Model Network,” achieved state of the art results on benchmark datasets in both visual and structured domains. The model translates from questions to dynamically assembled neural networks, then applies these networks to world representations (images or knowledge bases) to produce answers. The model has two components, trained jointly: a collection of neural “modules” that can be freely composed, and a network layout predictor that assembles modules into complete deep networks tailored to each question (see their Figure 1). Training data consists of (world, question, answer) triples: the approach requires no supervision of network layouts. They achieved state of the art performance on two markedly different question answering tasks: questions about natural images, and another with more compositional questions about United States geography.

In NLP parts of speech (POS), content words are words that name objects of reality and their qualities. They signify actual living things (dog, cat, etc.), family members (mother, father, sister, etc.), natural phenomena (snow, Sun, etc.) common actions (do, make, come, eat, etc.), characteristics (young, cold, dark, etc.), etc. They consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words. Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as prepositions (in, out, under, etc.), pronouns (I, you, he, who, etc.), conjunctions (and, but, till, as, etc.), etc.

Most models based on the seq2seq model with an encoder-decoder framework are equipped with an attention mechanism, like Google’s Transformer mechanism. However, these conventional attention mechanisms treat the decoding at each time step equally with the same matrix, which is problematic since the softness of the attention for different types of words (e.g. content words and function words) should differ. Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation  [code: not yet available, 2018-08-24] addressed this issue, proposing a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an “attention temperature.” They set a temperature parameter which could be learned by the model based on the attentions in the previous decoding time steps, as well as the output of the decoder at the current time step. With the temperature parameter, the model was able to automatically tune the degree of softness of the distribution of the attention scores. Specifically, the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. In a neural machine translation task, they showed that SACT could attend to the most relevant elements in the source-side contexts, generating translation of high quality.

[Table of Contents]

Attention: Miscellaneous Applications

Although this is more NLP task related, I wanted to group the following content close to the language model “attentional mechanisms” discussion in my “Attention Based Mechanisms” subsection. Recent applications of Google *Transformer (or related attentional architectures) relevant to this REVIEW include their use in “slot filling” (relation extraction), question answering, and document summarization.*

Position-aware Self-attention with Relative Positional Encodings for Slot Filling described how to apply self-attention with relative positional encodings to the task of relation extraction. Their model relied only on attention: no recurrent or convolutional layers were used. The authors (Bilan and Roth) employed Google’s Transformer seq2seq model (Attention Is All You Need), also known as as multi-head dot product attention (MHDPA) or “self-attention.” Despite citing Zhang et al.’s (Christopher Manning; Stanford University) Position-aware Attention and Supervised Data Improve Slot Filling and using the TACRED relation extraction dataset introduced by Zhang et al. in that paper, Bilan and Roth claim “To the best of our knowledge, the transformer model has not yet been applied to relation classification as defined above (as selecting a relation for two given entities in context).” – a rather tenuous claim. Furthermore, they provide no code, while Zhang et al. release their code, here, and include ablation studies in their work. The attention mechanism used by Zhan et al. differs significantly from the Google Transformer model in their use of the summary vector and position embeddings, and the way their attention weights were computed. While Zhang et al.’s F1 scores (their Table 4) were slightly lower than Bilan and Roth’s (their Table 1) on the TACRED dataset, the ensemble model in the former had the best scores. Sample relations extracted from a sentence are shown in Fig. 1 in Zhang et al.

Discussed elsewhere in this REVIEW, Bidirectional Attention Flow for Machine Comprehension introduced the BiDAF framework, a multi-stage hierarchical process that used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization. BiDAF was subsequently used in QA4IE: A Question Answering based Framework for Information Extraction, a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, also using a knowledge base (Wikipedia Ontology) for entity recognition.

Related to attention-based relation extraction, Neural Architectures for Open-Type Relation Argument Extraction extended and redefined the problem of slot filling to the task of Open-type Relation Argument Extraction (ORAE): given a corpus, a query entity Q and a knowledge base relation (e.g.,”Q authored notable work with title X”), the model had to extract an argument of “non-standard entity type” (i.e., entities that cannot be extracted by a standard named entity tagger) from the corpus – hence, “open-type argument extraction.” This work also employed the Transformer described in Attention Is All You Need, used as a multi-headed self-attention mechanism in their encoders for computing sentence representations suitable for argument extraction. Their approach to ORAE had two conceptual advantages. First, it was more general than slot-filling as it was also applicable to non-standard named entity types that could not be dealt with previously. Second, while the problem they defined was more difficult than standard slot filling, they eliminated an important source of errors: tagging errors that propagate throughout the pipeline and that are notoriously hard to correct downstream. A wide range of neural network architectures to solve ORAE were examined, each consisting of a sentence encoder, which computed a vector representation for every sentence position, and an argument extractor, which extracted the relation argument from that representation. The combination of a RNN encoder with a CRF extractor gave the best results, +7% absolute F-measure better than a previously proposed adaptation of a state of the art question answering model (BiDAF). [“The dataset and code will be released upon publication.” – not available, 2018-08-17.]

Discussed earlier and briefly mentioned here, Generating Wikipedia by Summarizing Long Sequences] by Google Brain considered English Wikipedia as a supervised machine learning task for multi-document summarization. They used extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. They modified the Transformer architecture (Attention Is All You Need) to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. They showed that their modeling improvements allowed them to generate entire Wikipedia articles.

Because the amount of text in input reference documents can be very large (see Table 2) it was infeasible to train an end-to-end abstractive model given the memory constraints of current hardware. Hence, they first coarsely selected a subset of the input using extractive summarization. The second stage involved training an abstractive model that generated the Wikipedia text while conditioning on this extraction. This two-stage process was inspired by by how humans might summarize multiple long documents: first highlighting pertinent information, then conditionally generating the summary based on the highlights.

Hierarchical Bi-directional Attention-based RNNs for Supporting Document Classification on Protein-protein Interactions Affected by Genetic Mutations  [code] leveraged word embeddings trained on PubMed abstracts. The authors argued that as the titles of the papers usually contains important information that is more salient than a typical sentence in the abstract. They therefore proposed a shortcut connection that integrated the title vector representation directly to the final feature representation of the document. They concatenated the sentence vector that represented the title and the vectors of the abstract to the document feature vector used as input to the task classifier. This system ranked first among the Document Triage Task of the BioCreative VI Precision Medicine Track.


The “spirit” of the BioCreative VI Track 4: Mining protein interactions and mutations for precision medicine (PM) is (bolded emphasis mine)

“The precision medicine initiative (PMI) promises to identify individualized treatment depending on a patients’ genetic profile and their related responses. In order to help health professionals and researchers in the precision medicine endeavor, one goal is to leverage the knowledge available in the scientific published literature and extract clinically useful information that links genes, mutations, and diseases to specialized treatments. … Understanding how allelic variation and genetic background influence the functionality of these pathways is crucial for predicting disease phenotypes and personalized therapeutical approaches. A crucial step is the mapping of gene products functional regions through the identification and study of mutations (naturally occurring or synthetically induced) affecting the stability and affinity of molecular interactions.”

Against those criteria and despite the title of this paper and this excerpt from the paper,

“In order to incorporate domain knowledge in our system, we annotate all biomedical named entities namely genes, species, chemical, mutations and diseases. Each entity mention is surround by its corresponding tags as in the following example: Mutations in <species>human</species> <gene>EYA1</gene> cause <disease>branchio-oto-renal (BOR) syndrome</disease> …”

… there is no evidence that mutations (i.e. genomic variants) were actually tagged. Mutations/variants are not discussed, nor is there any mention of “mutant” or “mutation” in their GitHub repository/code nor the parent repo.

Richard Socher and colleagues (SalesForce) also used shortcut connections [i.e., “residual layers” (Deep Residual Learning for Image Recognition)] in their paper A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks from higher layers to lower layers (lower-level task predictions) – in their tasks to reflect linguistic hierarchies.

[Table of Contents]

Language Models

A particularly exciting recent advance in NLP is the development of pretrained language models such as ELMo (released in February 2018 by Allen NLP), ULMFiT (May 2018 by and Aylien Ltd.), and OpenAI Transformer (June 2018 by OpenAI). Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks.

For an introductory discussion of the potential impact of those models, see Sebastian Ruder’s July 2018 blog post NLP’s ImageNet Moment has Arrived (Ruder is a coauthor on the ULMFiT paper), and the excellent YouTube video mentioned here  [local copy].

ImageNet [datasetcommunity challengesdiscussion  (local copy)] and related image classification, segmentation, and captioning challenges have had an enormous impact on the advancement of computer vision, deep learning, deep learning architecture, the use of pretrained models, transfer learning, attentional mechanisms, etc. – as well as identifying gaps in our understanding of that success (which has led to demystify how those deep neural networks classified images; explained very well in this excellent video, and other issues including the vulnerability of deep learning to the impact of adversarial attacks. It is anticipated that deeply trained language models may offer at least some degree of parallel impact in the NLP domain.

Aside: the availability of pretrained models is an important and often necessary advance in machine learning, as many of the current processing tasks in image processing and NLP language modeling are extremely computationally intensive. For example:

ELMo  [project;  tutorials here and here;  discussion here and here] models both the complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy: words or phrases with different, but related, meanings). These word vectors are learned functions of the internal states of a deep bidirectional language model (Bi-LSTM), which is pretrained on a large text corpus. Unlike most widely used word embeddings, ELMo word representations are deep, in that they are a function of the internal, hidden layers of the bi-directional Language Model (biLM), which provides a very rich representation about the tokens. More specifically, ELMo learns a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer. These word vectors can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment, semantic role labeling, coreference resolution, named entity extraction, and sentiment analysis. The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions.

In Universal Language Model Fine-tuning for Text Classification  [project/code;  code here  herehere and here;  discussion herehereherehereherehere and here] Jeremy Howard and Sebastian Ruder at described their Universal Language Model Fine-tuning (ULMFiT) language model, a transfer learning method that can be applied to any task in NLP [but as of July 2018 they had only studied its use in classification tasks], as well as key techniques for fine-tuning a language model. They also provided the fastai.text and fastai.lm_rnn modules necessary to train/use their ULMFiT models, ULMFiT significantly outperformed the state of the art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, ULMFiT matched the performance of training from scratch on 100x more data.

OpenAI’s “Finetuned Transformer LM” (Improving Language Understanding by Generative Pre-Training)  [projectcode;  discussion: here and here] demonstrated that large gains on diverse natural language understanding (NLU) tasks such as textual entailment, question answering, semantic similarity assessment, and document classification could be realized by a two stage training procedure: generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, they made use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. The results provided a convincing example that pairing supervised learning methods with unsupervised pre-training works very well. They demonstrated the effectiveness of their approach on a wide range of benchmarks for NLU. Their general task-agnostic model outperformed discriminatively trained models that used architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied (see the table at Improving Language Understanding with Unsupervised Learning). For instance, they achieved absolute improvements of 8.9% on commonsense reasoning, 5.7% on question answering, and 1.5% on textual entailment (natural language inference).

  • The architecture employed in Improving Language Understanding by Generative Pre-Training, Finetuned Transformer LM, is Google’s Transformer seq2seq self-attention mechanism (later rebranded by Google as multi-head dot product attention (MHDPA; also referred to as “self-attention”), first described in Google’s Attention Is All You Need paper. The Transformer model has been shown to perform strongly on various tasks such as machine translation, document generation, and syntactic parsing. This model choice provided OpenAI with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, they utilized task-specific input adaptations derived from traversal-style approaches, which processed structured text input as a single contiguous sequence of tokens. As they demonstrated in their experiments, these adaptations enabled them to fine-tune effectively with minimal changes to the architecture of the pretrained model. OpenAI’s Finetuned Transformer LM model largely followed the original Transformer work: they trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads) …

  • The Transformer architecture does not use RNN (LSTM), relying instead on the use of the self-attention mechanism. The OpenAI authors (Improving Language Understanding by Generative Pre-Training) assert that in contrast to the LSTM models employed in ELMo and ULMFiT, the use of LSTM in those LM restricts their prediction ability to a short range; in contrast, OpenAI’s choice of Transformer networks allowed them to capture longer-range linguistic structure. Regarding better understand why LM pre-training of Transformers is effective, they hypothesized that the underlying generative model learns to perform many of the evaluated tasks in order to improve its language modeling capability and that the more structured attentional memory of the Transformer assisted in transfer, compared to LSTM.

Neural language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. In The Importance of Generation Order in Language Modeling Google Brain studied the influence of token generation order on model quality via a novel two-pass language model that produced partially-filled sentence “templates” and then filled in missing tokens. The most effective strategy generates function words in the first pass followed by content words in the second.

The Fine Tuning Language Models for Multilabel Prediction GitHub repository lists recent, leading language models – for which they examine the ability to use generative pre-training with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.

[Table of Contents]

Probing the Effectiveness of Pretrained Language Models

Contextual word representations derived from pre-trained bidirectional language models (biLM) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks, including question answering, entailment and sentiment classification, constituency parsing, named entity recognition, and text classification. However, many questions remain as to how and why these models are so effective. An extremely interesting paper (August 2018) from the Allen Institute for Artificial Intelligence, Dissecting Contextual Word Embeddings: Architecture and Representation  [note also the Appendices], presented a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. They showed there is a tradeoff between speed and accuracy, but all architectures learned high quality contextual representations that outperformed word embeddings for four challenging NLP tasks (natural language inference/textual entailment; semantic role labeling; constituency parsing; named entity recognition).

That study also showed that Deep biLM learned a rich hierarchy of contextual information, both at the word and span level, that was captured in three disparate types of network architectures (LSTM, CNN, or self attention). In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network in an analogous manner to how deep CNNs trained for image classification learn a hierarchy of image features (Zeiler and Fergus, 2014). For example, they showed that in contrast to traditional word vectors which encode some semantic information, the word embedding layer of deep biLM focused exclusively on word morphology. Moving upward in the network, the lowest contextual layers of biLM focused on local syntax, while the upper layers could be used to induce more semantic content such as within-sentence pronominal coreferent clusters. They also showed that the biLM activations could be used to form phrase representations useful for syntactic tasks. Together, these results suggest that large scale biLM, independent of architecture, are learning much more about the structure of language than previous appreciated.

Deep RNNs Encode Soft Hierarchical Syntax evaluated how well a simple feed-forward classifier could detect syntax features (part of speech tags as well as various levels of constituent labels) from the word representations produced by the RNN layers from deep NLP models trained on the tasks of dependency parsing, semantic role labeling, machine translation, and language modeling. They demonstrated that deep RNN trained on NLP tasks learned internal representations that captured soft hierarchical notions of syntax across different layers of the model (i.e., the representations taken from deeper layers of the RNNs perform better on higher-level syntax tasks than those from shallower layers), without explicit supervision. These results provided some insight as to why deep RNNs are able to model NLP tasks without annotated linguistic features. ELMo, for example, represents each word using a task-specific weighted sum of the language models hidden layers; i.e., rather than using only the top layer, ELMo selects which of the language models internal layers contain the most relevant information for the task at hand.

Evaluation of Sentence Embeddings in Downstream and Linguistic Probing Tasks surveyed recent unsupervised word embedding models, including fastText (discussed above under “Text Classification”), ELMo and other models. They noted that two main challenges exist when learning high-quality representations: they should capture semantic and syntax, and the different meanings the word can represent in different contexts (polysemy). ELMo, which performed very well among the models surveyed, addressed both of these issues. [Likewise, the best-performing parser in Constituency Parsing with a Self-Attentive Encoder  [code] used a factored self-attentive encoder over ELMo word representations (pretrained word embeddings).] As in fastText, ELMo breaks the tradition of word embeddings by incorporating sub-word units, but ELMo has also some fundamental differences with previous shallow representations such as fastText or Word2Vec. ELMo uses a deep representation by incorporating internal representations of the LSTM network, therefore capturing the meaning and syntactical aspects of words. Since ELMo is based on a language model, each token representation is a function of the entire input sentence, which can overcome the limitations of previous word embeddings where each word is usually modeled as an average of their multiple contexts. ELMo embeddings provide a better understanding of the contextual meaning of a word, as opposed to traditional word embeddings that are not only context-independent but have a very limited definition of context.

[Table of Contents]

RNN, CNN, or Self-Attention?

In the course of writing this REVIEW and in my other readings I often encountered discussions of RNN vs. CNN. vs. self-attention architectures in regard to NLP and language models. Here, I summarize some of those observations; green-colored URL are internal hyperlinks to discussions of those items, elsewhere in this REVIEW.

  • Dissecting Contextual Word Embeddings: Architecture and Representation discussed contextual word representations derived from pre-trained bidirectional language models (biLM), showing that Deep biLM learned a rich hierarchy of contextual information that was captured in three disparate types of network architectures: LSTM, CNN, or self attention. In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network.

  • Recently, non-recurrent architectures (convolutional; self-attentional) have outperformed RNN in neural machine translation. CNN and self-attentional networks can connect distant words via shorter network paths than RNN, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument has not been tested empirically, nor have alternative explanations for their strong performance been explored in-depth. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures hypothesized that the strong performance of CNN and self-attentional networks could also be due to their ability to extract semantic features from the source text, and they evaluated RNN, CNN and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Their experimental results showed that self-attentional networks and CNN do not outperform RNN in modeling subject-verb agreement over long distances, and that self-attentional networks perform distinctly better than RNN and CNN on word sense disambiguation.

  • Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of a simple RNN (S-RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicated that LSTM suffer little to no performance loss when removing the S-RNN. This provided evidence that the gating mechanism was doing the heavy lifting in modeling context. They further ablated the recurrence in each gate and found that this incurred only a modest drop in performance, indicating that the real modeling power of LSTM stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allowed them to mathematically relate LSTM and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enabled the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.

  • Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance. They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy. The RNN-LSTM ensemble achieved a new state of the art that more than five percentage points above the previous best result, and the RNN-LSTM aggregation function was superior to aggregation via CNN, improving the validation accuracy by 1.5 percentage points. In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

  • The architecture proposed in QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD1.1 dataset their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data. However, another model, Reinforced Mnemonic Reader for Machine Reading Comprehension performed as well as QANet. Reinforced Mnemonic Reader, based on a Bi-LSTM, is an enhanced attention reader – suggesting that the improvements in these models is due to the attention mechanisms, vs. RNN or CNN architectures.

  • While RNN are a cornerstone in learning latent representations from long text sequences, a purely convolutional and deconvolutional autoencoding framework may be employed, as described in Deconvolutional Paragraph Representation Learning. That paper addressed the issue that the quality of sentences during RNN-based decoding (reconstruction) decreased with the length of the text. Compared to RNN, their framework was better at reconstructing and correcting long paragraphs. Note Table 1 in their paper (showing paragraphs reconstructed by LSTM and CNN, as well as the vastly superior BLEU/ROUGE scores in Table 2); there is also additional NLP-related LSTM vs. CNN discussion in this Hacker News thread.

  • Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition compared the use of LSTM based and CNN based character level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus showed that the use of either type of character level word embeddings in conjunction with the BiLSTM-CRF models led to comparable state of the art performance. However, the models using CNN based character level word embeddings had a computational performance advantage, increasing training time over word based models by 25% while the LSTM based character level word embeddings more than doubled the required training time.

  • Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers studied the role of linguistic context in predicting quantifiers (“few”, “all”). Overall, LSTM were the best-performing architectures, with CNN showing some potential in the handling of longer sequences.

[Table of Contents]

Question Answering and Reading Comprehension

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accepts queries in natural language and returns accurate answers quickly and efficiently.

Question answering (a natural language understanding problem) and reading comprehension (the task of answering a natural language question about a paragraph) are of considerable interest in NLP motivating, for example, Human-Computer Question Answering Competition (in the NIPS 2017 Competition Track) and in the BioNLP domain the BioASQ Challenge. Unlike generic text summarization, reading comprehension systems could answer targeted questions about specific documents, efficiently extracting facts and insights ([How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks)(

The Stanford Question Answering Dataset (SQuAD: developed at Stanford University) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment or span of text from the corresponding reading passage (or, the question might be unanswerable). There has been a rapid progress on the SQuAD dataset, and early in 2018 engineered systems started achieving and surpassing human level accuracy on the SQuAD1.1 Leaderboard  (discussion: AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems). SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written by crowdworkers to look adversarially similar to answerable ones (ACL 2018 Best Short Paper: Know What You Don’t Know: Unanswerable Questions for SQuAD;  [project/code]). To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Carnegie Mellon University/Google Brain’s QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension”  [OpenReviewdiscussion/TensorFlow implementationcode] proposed a method (QANet) that did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD dataset (SQuAD1.1: see the leaderboard), their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data. More significantly to this discussion, on the adversarial SQuAD test set (Adversarial Examples for Evaluating Reading Comprehension Systems) QANet achieved significantly improved F1 scores (a F1 score is the harmonic mean of precision and recall) compared to BiDAF and other models (their Table 6), demonstrating the robustness of QANet to adversarial examples.

Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension  [non-author implementations: MnemonicReader | MRC | MRC-models] performed as well as QANet (compared in the QANet paper, QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension), outperforming previous systems by over 6% in terms of both Exact Match (EM) and F1 metrics on two adversarial SQuAD datasets. Reinforced Mnemonic Reader, based on Bi-LSTM, is an enhanced attention reader with two main contributions: a reattention mechanism, introduced to alleviate the problems of attention redundancy and deficiency in multi-round alignment architectures; and, a dynamic-critical reinforcement learning approach, to address the convergence suppression problem that exists in traditional reinforcement learning methods.

IBM Research recently introduced a new dataset for reading comprehension (DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension)  [projectdata]. DuoRC is a large scale reading comprehension (RC) dataset of 186K human-generated QA pairs created from 7,680 pairs of parallel movie plots taken from Wikipedia and IMDb. By design, DuoRC ensures very little or no lexical overlap between the questions created from one version and segments containing answers in the other version. Essentially, this is a paraphrase dataset, that should be very useful for training reading comprehension models. For example, the authors observed that state of the art neural reading comprehension models that achieved near human performance on the SQuAD dataset exhibited very poor performance on the DuoRC dataset (F1 scores of 37.42% on DuoRC vs. 86% on SQuAD), opening research avenues wherein DuoRC could complement other RC datasets to explore novel neural approaches for studying language understanding. Likewise, DuoRC (introduced in April 2018) could be useful in sentence embedding approaches to natural language tasks (machine translation; document classification; sentiment analysis; etc.); for example, the Conclusions in Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition (August 2018) stated: “The main limitation of the current work is that there are insufficient paraphrase sentences for training the models.

In a very thorough and thoughtful analysis, Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension  [code] proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance. Highlights from this work are [substantially] paraphrased here:

  • They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy.

  • All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

  • The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction.

  • The sentence attention allowed them to get more insight into the models’ inner state. For example, it allowed them to check whether the model actually focused on relevant sentences in order to answer the questions. Both model variants [CNN; RNN-LSTM] paid most attention to the relevant plot sentences for 70% of the cases. Identifying the relevant sentences was an important success factor: relevant sentences were ranked highest only in 35% of the incorrectly solved questions.

  • Textual entailment was required to solve 60% of the questions …

  • The process of elimination and heuristics proved essential to solve 44% of the questions …

  • Referential knowledge was presumed in 36% of the questions …

  • Furthermore, it was apparent that many questions expected a combination of various reasoning skills.

  • In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

  • Finally, their intensive analysis on the differences between the model and human inference suggest that both models seem to learn matching patterns to select the right answer rather than performing plausible inferences as humans do. The results of these studies also imply that other human like processing mechanism such as referential relations, implicit real world knowledge, i.e., entailment, and answer by elimination via ranking plausibility Hummel and Holyoak, 2005 should be integrated in the system to further advance machine reading comprehension.

Collectively, those results indicate the difficulty in achieving robust reading comprehension, and the need to develop new models that understand language more precisely. Addressing this challenge will require employing more difficult datasets (like SQuAD2.0) for various tasks, evaluation metrics that can distinguish real intelligent behavior from shallow pattern matching, and the development of more sophisticated models that understand language at a deeper level.

  • The need for more challenging datasets is echoed in Sebastian Ruder’s ACL 2018 Highlights summary, in the “Creating harder datasets” subsection:

    In order to evaluate under such settings, more challenging datasets need to be created. Yejin Choi argued during the RepL4NLP panel discussion (a summary can be found here) that the community pays a lot of attention to easier tasks such as SQuAD or bAbI, which are close to solved. Yoav Goldberg even went so far as to say that “SQuAD is the MNIST of NLP”. Instead, we should focus on solving harder tasks and develop more datasets with increasing levels of difficulty. If a dataset is too hard, people don’t work on it. In particular, the community should not work on datasets for too long as datasets are getting solved very fast these days; creating novel and more challenging datasets is thus even more important. Two datasets that seek to go beyond SQuAD for reading comprehension were presented at the conference:

    QAngaroo (Constructing Datasets for Multi-hop Reading Comprehension Across Documents  [project]) focuses on reading comprehension that requires to gather several pieces of information via multiple steps of inference.

    NarrativeQA (The NarrativeQA Reading Comprehension Challenge  [GitHub]) requires understand of an underlying narrative by asking the reader to answer questions about stories by reading entire books or movie scripts.

    Richard Socher also stressed the importance of training and evaluating a model across multiple tasks during his talk during the Machine Reading for Question Answering workshop. In particular, he argues that NLP requires many types of reasoning, e.g. logical, linguistic, emotional, etc., which cannot all be satisfied by a single task.

  • Read + Verify: Machine Reading Comprehension with Unanswerable Questions proposed a novel read-then-verify system that combined a base neural reader with a sentence-level answer verifier trained to further validate if the predicted answer is entailed by input snippets. They also augmented their base reader with two auxiliary losses to better handle answer extraction and no-answer detection respectively, and investigated three different architectures for the answer verifier. On the SQuAD2.0 dataset their system achieved a F1 score of 74.8 on the development set (ca. August 2018), outperforming the previous best published model by more than 7 points (and the best reported model by ~3.5 points (2018-08-20: SQuAD2.0 Leaderboard).

In addition to SQuAD2.0 and DuoRC, other recent datasets related to question-answering and reasoning include:

  • Facebook AI Research’s bAbI project;  [code], a set of 20 tasks for testing text understanding and reasoning described in detail in the paper Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks;

  • University of Pennsylvania’s MultiRC: Reading Comprehension over Multiple Sentences;  [projectcode], a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph, the goal of this dataset is to encourage the research community to explore approaches that can do more than sophisticated lexical-level matching.

  • Allen Institute for Artificial Intelligence (AI2)’s Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge;  [projectcode], which presented a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI (Stanford Natural Language Inference Corpus). As noted in their Conclusions:

    “Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community.

    • ARC was recently used in Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Scientific Question Answering by authors at UC San Diego and Microsoft AI+R. Existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper, the authors proposed a retriever-reader model that learned to attend on [via self-attention layers] essential terms during the question answering process via an essential-term-aware “retriever” which first identified the most important words in a question, then reformulated the queries and searches for related evidence, and an enhanced “reader” to distinguish between essential terms and distracting words to predict the answer. On the ARC dataset their model outperformed the existing state of the art [e.g. BiDAF] by 8.1%.

Among the many approaches to QA applied to textual sources, the attentional long short-term memory (LSTM)-based and Bi-LSTM, memory-based implementations of Richard Socher (SalesForce) are particularly impressive:

  • Ask Me Anything: Dynamic Memory Networks for Natural Language Processing  (note Fig. 3) introduced the dynamic memory network (DMN), a neural network architecture that processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process that allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. [For an good overview of the DMN approach, see slides 39-47 in Neural Architectures with Memory.]

  • Based on analysis of DMN (above), Richard Socher/MetaMind (later acquired by SalesForce) proposed several improvements to the DMN memory and input modules; the new DMN+ model (Dynamic Memory Networks for Visual and Textual Question Answering;  [discussion]) improved the state of the art on visual and text question answering datasets, without supporting fact supervision. Non-author DMN+ code available on GitHub includes a Theano (Improved-Dynamic-Memory-Networks-DMN-plus) and TensorFlow (Dynamic-Memory-Networks-in-TensorFlow) implementations.

  • Dynamic Coattention Networks for Question Answering;  [non-author code, on SQuAD2.0] introduced the Dynamic Coattention Network (DCN) for QA. DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both, then a dynamic pointing decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers. On the Stanford question answering dataset, a single DCN model improved the previous state of the art from 71.0% F1 to 75.9%, while a DCN ensemble obtained a 80.4% F1 score.

  • Efficient and Robust Question Answering from Minimal Context Over Documents studied the minimal context required to answer a question and found that most questions in existing datasets can be answered with a small set of sentences. The authors proposed a simple sentence selector to select the minimal set of sentences to feed into the QA model, which allowed the system to achieve significant reductions in training (up to 15 times) and inference times (up to 13 times), with accuracy comparable to or better than the state of the art on SQuAD, NewsQA, TriviaQA and SQuAD-Open. Furthermore, the approach was more robust to adversarial inputs.

Understandably, Socher’s work has generated much interest among NLP and ML practitioners, leading to the acquisition of his startup, MetaMind, by Salesforce for $32.8 million in 2016 (Salesforce Reveals It Spent $75 Million on the Three Startups It Bought Last Quarter | Salesforce just bought a machine learning startup that was backed by its CEO Marc Benioff). While those authors will not release the code (per a comment by Richard Socher on reddit), using the search term “1506.07285” there appears to be four repositories on GitHub that attempt to implement his Ask Me Anything: Dynamic Memory Networks for Natural Language Processing paper, while a GitHub search for “dynamic memory networks” or “DMN+” gives numerous repositories.

Aside: in this April 2018 60 Minutes interview, SalesForce CEO Marc Benioff speaks passionately of women’s rights, gender equality, and other issues that speak very highly of the stewardship of that organization, and hence their influence in the NLP domain via Socher and his colleagues (since the DeepMind acquisition by SalesForce, Socher continues to regularly publish highly impressive work in this domain).

A system that contemporaneously surpassed Socher’s highly performing DMN+ model on the bAbI dataset was the MemN2N architecture, described by Jason Weston (Facebook AI Research) in his highly-cited End-To-End Memory Networks paper (see the “E2E column” in Table 2 in that paper)  [code;  non-author code herehere and here;  discussion here and here]. MemN2N, a recurrent attention model over a possibly large external memory, was trained end-to-end and hence required significantly less supervision during training, making it more generally applicable in realistic settings. The flexibility of the MemN2N model allowed the authors to apply it to tasks as diverse as synthetic question answering (QA) and language modeling (LM). For QA the approach was competitive with memory networks but with less supervision; for LM their approach demonstrated performance comparable to RNN and LSTM on the Penn TreeBank and Text8 datasets.

A precaution with high-performing but heavily engineered systems is domain specificity: How well do those models transfer to other applications? I encountered this issue in my PRELIMINARY WORK (not shown) when I carefully examined the Turku Event Extraction System (TEES 2.2: Biomedical Event Extraction for Diverse Corpora; discussed below). TEES preforms well but was heavily engineered to perform well in the various BioNLP Challenge tasks in which it participated. Likewise, a June 2018 comment in the AllenNLP GitHub repository, regarding end-to-end memory networks, is of interest:

“Why are you guys not using Dynamic Memory Networks in any of your QA solutions? → I’m not a huge fan of the models called “memory networks” - in general they are too tuned to a completely artificial task, and they don’t work well on real data. I implemented the “end-to-end memory network”, for instance, and it has three separate embedding layers (which is absolutely absurd if you want to apply it to real data). @DeNeutoy implemented the DMN+. It’s not as egregious as the E2EMN [end-to-end memory network], but still, I’d look at actual papers, not blogs, when deciding what methods actually work. E.g., are there any memory networks on the SQuAD leaderboard ( On the TriviaQA leaderboard? On the leaderboard of any recent, popular dataset? To be fair, more recent “memory networks” have modified their architectures so they’re a lot more similar to things like the gated attention reader, which has actually performed well on real data. But, it sure seems like no one is using them to accomplish state of the art QA on real data these days.”

I believe that the “gated attention reader” mentioned in that comment (above) refers to Gated-Attention Readers for Text Comprehension [code] by Ruslan Salakhutdinov and colleagues. That work, in turn, employed an attention mechanism that was introduced into the NLP domain in 2014 by Yoshua Bengio and colleagues in their very highly cited Neural Machine Translation by Jointly Learning to Align and Translate paper. Their model consisted of a forward and backward pair of RNN (BiRNN) for the encoder, and a decoder that emulated searching through a source sentence while decoding a translation. The use of a BiRNN allowed the annotation of each word to summarize not only the preceding words, but also the following words. For the activation function of a RNN, they used a gated hidden unit, an alternative to the conventional simple units such as an element-wise tanh. This gated unit is similar to a LSTM unit, sharing with it the ability to better model and learn long-term dependencies. This mechanism calculated how much attention the network should give to each source word to generate a specific translated word. The context vector calculated by the attention mechanism mimics the syntactic skeleton of the input sentence precisely given a sufficient number of examples. Recent work (Inducing Grammars with and for Neural Machine Translation) suggests that incorporating explicit syntax alleviates the burden of modeling grammatical understanding and semantic knowledge from the model. By letting the decoder have an attention mechanism, Bengio an colleagues relieved the encoder from the burden of having to encode all information in the source sentence into a fixed length vector; with this new approach, the information could be spread throughout the sequence of annotations, which could be selectively retrieved by the decoder accordingly.

In 2015 the Allen Institute for Artificial Intelligence launched a Kaggle contest on automated question answering, The Allen AI Science Challenge: Is Your Model Smarter Than an 8th Grader?. One of the top-placing teams, 23rd among 799 teams on the Kaggle Leaderboard for this competition, implemented a Dynamic Memory Network based on Ask Me Anything: Dynamic Memory Networks for Natural Language Processing; they described their solution in their blog post Implementing Dynamic Memory Networks, as well as a follow-on post, Playground for bAbI tasks  [code].

The Allen Institute for Artificial Intelligence’s Bidirectional Attention Flow for Machine Comprehension introduced the Bi-Directional Attention Flow (BIDAF) framework (projectcodedemo], a multi-stage hierarchical process that represented context at different levels of granularity and used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization. That open sourced work/code was subsequently used in QA4IE: A Question Answering based Framework for Information Extraction  (note Table 7; code], a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, along with a knowledge base (Wikipedia Ontology) for entity recognition. Note also that QA4IE processed entire documents as a whole, instead of separately processing individual sentences. QA4IE was outperformed by BiDAF on the SQuAD dataset, because QA4IE was designed to produce sequence answers in IE settings. Conversely, QA4IE outperformed QA systems across 6 datasets in IE settings (refer to Tables 3 and 4 in QA4IE: A Question Answering based Framework for Information Extraction). A major difference between QA settings and IE settings is that in QA settings each query corresponds to an answer, while in the QA4IE framework the QA model take a candidate entity-relation (or entity-property) pair as the query and it needs to tell whether an answer to the query can be found in the input text.

In other work relating to Bi-LSTM-based question answering, IBM Research and IBM Watson published a paper, Improved Neural Relation Detection for Knowledge Base Question Answering, which focused on relation detection via deep residual Bi-LSTM networks to compare questions and relation names. The approach broke the relation names into word sequences for question-relation matching, built both relation level and word level relation representations, used deep BiLSTMs to learn different levels of question representations in order to match the different levels of relation information, and finally used a residual learning method for sequence matching. This made the model easier to train and resulted in more abstract (deeper) question representations, thus improving hierarchical matching. Several non-Microsoft implementations are available on GitHub (machine-comprehension; machine-reading-comprehension; and most recently, MSMARCO).

Making Neural QA as Simple as Possible but not Simpler introduced “FastQA,” a simple, context/type matching heuristic for extractive question answering. The paper posited that two simple ingredients are necessary for building a competitive QA system: awareness of the question words while processing the context, and a composition function (such as recurrent neural networks) which goes beyond simple bag-of-words modeling. In follow-on work, these authors applied FastQA to the biomedical domain (Neural Domain Adaptation for Biomedical Question Answering;  [code]). Their system – which did not rely on domain-specific ontologies, parsers or entity taggers – achieved state of the art results on factoid questions, and competitive results on list questions.

A recent review, Comparative Analysis of Neural QA Models on SQuAD, reported that models based on a gated attention mechanism (“R-Net ”), or a GRU (“DocQA”), performed well across a variety of tasks. Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension, a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset defines the task as answering a question from multiple passages and the words in the answer are not necessary in the passages. The Microsoft Research approach employed bidirectional gated recurrent units (BiGRU) instead of RNN (Bi-LSTM). The answer extraction model was first employed to predict the most important sub-spans from the passage as evidence, and the answer synthesis model took the evidence as additional features along with the question and passage to further elaborate the final answers. They built the answer extraction model for single passage reading comprehension, and proposed an additional task of ranking the single passages to help in answer extraction from multiple passages. An earlier paper from Microsoft Research (Hierarchical Attention Networks for Document Classification;  [non-author code here and here] also employed a (bidirectional) GRU.

The examples shown in Table 1 in the Microsoft Research paper (S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension) are highly relevant to this REVIEW: the evidence passages used to synthesize the answers are presented, making this less of a “black box” approach and a potentially useful approach to evidence based biomedical question answering.

Likewise (regarding evidence based answering), textual entailment with neural attention methods could also be applied; for example, as described in DeepMind’s Reasoning about Entailment with Neural Attention  [non-author code here and here].

Dynamic Coattention Networks for Question Answering  [non-author code: uses SQuAD2.0] introduced the Dynamic Coattention Network (DCN) for QA. DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both; then, a dynamic pointing decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers. On the Stanford question answering dataset, a single DCN model improved the previous state of the art from 71.0% F1 to 75.9%, while a DCN ensemble obtained a 80.4% F1 score.

The Natural Language Decathlon: Multitask Learning as Question Answering  [codeproject] by Richard Socher and colleagues at SalesForce described a NLP challenge spanning 10 tasks (question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution). Many of those tasks are relevant to querying and processing output from my PROPOSED textual knowledge store. In that “decaNLP” paper Socher and colleagues also introduced a new Multitask Question Answering Network (MQAN) model [code here and here] that jointly learned all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. The MQAN model takes in a question and context document, encodes both with a Bi-LSTM, uses dual coattention to condition representations for both sequences on the other, compresses all of this information with another two Bi-LSTMs, applies self-attention to collect long-distance dependency, and then uses a final two Bi-LSTMs to get representations of the question and context. The multi-pointer-generator decoder uses attention over the question, context, and previously outputted tokens to decide whether to copy from the question, copy from the context, or generate the answer from a limited vocabulary. MQAN showed improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification.

Robust and Scalable Differentiable Neural Computer for Question Answering  [code] is designed as a general problem solver which can be used in a wide range of tasks. Their GitHub repository contains a implementation of an Advanced Differentiable Neural Computer (ADNC) for a more robust and scalable usage in question answering. The ADNC was applied to the 20 bAbI QA tasks, with state of the art mean results, and to the CNN Reading Comprehension Task with passable results without any adaptation or hyper-parameter tuning. Coauthor Jörg Franke’s Master’s Thesis contains additional detail.

(Studio Ousia’s Quiz Bowl Question Answering System  [commercial enterprise, no code release; slidesmedia] extended their earlier model (Fig. 2 in Representation Learning of Entities and Documents from Knowledge Base Descriptions) for question answering, as shown in Fig. 2 in this paper (Studio Ousia’s Quiz Bowl Question Answering System). This latter work very convincingly won the Human-Computer Question Answering Competition (HCQA) at NIPS 2017: Studio Ousia’s system scored more than double the combined human team score (465 to 200 points).

  • A minor critique of Studio Ousia’s Quiz Bowl Question Answering System: this paper states “The representations of words and entities are trained by jointly optimizing the following three sub-models: … [and] the knowledge base graph model, which learns to estimate neighboring entities given the target entity in the internal link graph between entities in Wikipedia.” However, I believe that they are referring to the embedded vector representations drawn from Wikipedia (including it’s base graph model), in their Wikipedia2Vec pretrained embeddings.

  • Aside: the ephemeral “Wikipedia knowledge graph” is mentioned/used in Wikipedia Knowledge Graph with DeepDive (the DeepDive project is now deprecated in favor of Snorkel. The Wikipedia knowledge graph is also mentioned in a Neo4j blog post, Building the Wikipedia Knowledge Graph in Neo4j. While I have not worked with those data, from various descriptions it appears to be a graph implicit in the Wikipedia data/API access – the DBpedia graph, more or less..

The highly-performing embedding approach described in Studio Ousia’s Quiz Bowl Question Answering System was very impressive, given that it could “reason” over passages such as the one shown in Table 1 in that paper. Pretrained on Wikipedia, it may be reasonably suitable to my work in the biomedical domain, since Wikipedia contains much biochemical, biomedical, genetics/genomics, clinical and other life sciences material.

Discussed elsewhere in this REVIEW, Question Answering by Reasoning Across Documents with Graph Convolutional Networks introduced a method (Entity-GCN) which reasons on information spread within/across documents, framed as an inference problem on a graph. Their approach differed from BiDAF and FastQA, which merely concatenate all documents into a single long text and train a standard reading comprehension model. Instead, they framed question answering as an inference problem on a graph representing the document collection.

[Table of Contents]

Natural Language Inference

Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise p (e.g. a piece of text) and a hypothesis h. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI) Corpus, contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of NLI. A newer, Multi-Genre Natural Language Inference (MultiNLI) corpus is also available: a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. NLI was one of the 10 tasks proposed in The Natural Language Decathlon: Multitask Learning as Question Answering  [code; project], a NLP challenge spanning 10 tasks introduced by Richard Socher and colleagues at Salesforce.

Google’s (A Decomposable Attention Model for Natural Language Inference likewise proposed a simple neural architecture for natural language inference that used attention to decompose the problem into subproblems that could be solved separately, thus making it trivially parallelizable. Their use of attention was purely based on word embeddings, essentially consisting of feed-forward networks that operated largely independently of word order. On the Stanford Natural Language Inference (SNLI) dataset, they obtained state of the art results with almost an order of magnitude fewer parameters than previous work, without relying on any word-order information. The approach outperformed considerably more complex neural methods aiming for text understanding, suggesting that – at least for that task – pairwise comparisons are relatively more important than global sentence-level representations. That said, that same model (plus also one based on a Bi-LSTM-based single sentence-encoding model without attention, and a hybrid TreeLSTM-based and Bi-LSTM-based model with an inter-sentence attention mechanism to align words across sentences) all performed poorly on a new NLI test set (Breaking NLI Systems with Sentences that Require Simple Lexical Inferences). The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set was substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences. That finding recalls my earlier discussion on adversarial challenges to BiDAF/SQuAD-based QA.

  • Regarding NLI and attention, Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference examined explaining deep learning based models through a case study on a popular neural model for NLI. They proposed interpreting intermediate layers of NLI models by visualizing the saliency of attention and LSTM gating signals, and presented several examples for which their methods were able to reveal interesting insights and identify the critical information contributing to the model decisions.

At the sentence level, InferSent from Facebook AI Research in Supervised Learning of Universal Sentence Representations from Natural Language Inference Data  [code;  discussion: A Walkthrough of InferSent - Supervised Learning of Sentence Embeddings, and reddit] showed how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference (SNLI) datasets could consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks, indicating the suitability of NLI for transfer learning to other NLP tasks. InferSent is an interesting approach by the simplicity of its architecture: it uses the SNLI dataset (a set of of 570K pairs of sentences labeled with 3 categories: neutral, contradiction and entailment; to train a classifier on top of a sentence encoder. Both sentences are encoded using the same encoder (a Bi-LSTM with a max-pooling operator), while the classifier is trained on a pair representation constructed from the two sentence embeddings.

  • Aside: for a long time, supervised learning of sentence embeddings was thought to give lower-quality embeddings than unsupervised approaches but this assumption has recently been overturned, in part following the publication of the “InferSent ” paper, which stated:

    “Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks.”

  • InferSent is an interesting approach by the simplicity of its architecture, a bi-directional LSTM complete with a max-pooling operator as sentence encoder. InferSent uses the Sentence Natural Language Inference dataset (a set of of 570k pairs of sentences labeled with 3 categories: neutral, contradiction and entailment) to train the classifier on top of the sentence encoder. Both sentences are encoded using the same encoder while the classifier is trained on a pair representation constructed from the two sentence embeddings.

Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture  [code] yielded state of the art results for SNLI sentence encoding-based models and the SciTail dataset, and provided strong results for the MultiNLI dataset. [The SciTail dataset is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs. Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis.] The sentence embeddings could be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7/10 and SkipThought on 8/9 SentEval sentence embedding evaluation tasks. Furthermore, their model beat the InferSent in 8/10 recently published SentEval probing tasks designed to evaluate the ability of sentence embeddings to capture some of the important linguistic properties of sentences.

“The success of the proposed hierarchical architecture raises a number of additional interesting questions. First, it would be important to understand what kind of semantic information the different layers are able to capture. Second, a detailed and systematic comparison of different hierarchical architecture configurations, combining BiLSTM and max pooling in different ways, could lead to even stronger results, as indicated by the results we obtained on the SciTail dataset with the modified 4-layered model. Also, as the sentence embedding approaches for NLI focus mostly on the sentence encoder, we think that more should be done to study the classifier part of the overall NLI architecture. There is not enough research on classifiers for NLI and we hypothesize that further improvements can be achieved by a systematic study of different classifier architectures, starting from the way the two sentence embeddings are combined before passing on to the classifier.”

Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. Bridging Knowledge Gaps in Neural Entailment via Symbolic Models focused on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Their architecture (NSnet) combined standard neural entailment models with a knowledge lookup module. To facilitate this lookup, they proposed a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. NSnet learned to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSnet outperformed a simpler combination of the two predictions by 3% and the base entailment model by 5%.

[Table of Contents]


The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time (Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning). Even within the more focused health, technical and scientific domains we face a never ending, daily onslaught of new information and knowledge from which we must filter out the non-relevant information, seeking to retain (or hoping to find again) knowledge that is relevant to us.

Information overload is characterized by the difficulty of understanding an issue and effectively making decisions when one has too much information about that issue. In our infocentric world, we have an increasing dependency on relevant, accurate information that is buried in the avalanche of continuously generated information. Coincident with information overload is the phenomenon of attention overload: we have limited attention and we’re not always sure where to direct it. It can be difficult to limit how much information we consume when there’s always something new waiting for a click; before we know it, an abundance of messy and complex information has infiltrated our minds. If our processing strategies don’t keep pace, our online explorations create strained confusion instead of informed clarity. Hence, More information is not necessarily better. [For more discussion, see the first part of the blog post Information Overload, Fake News, and Invisible Gorillas.]  When Choice is Demotivating: Can One Desire Too Much of a Good Thing? [pdf] discussed findings from 3 experimental studies that starkly challenged the implicit assumption that having more choices is more intrinsically motivating than having fewer choices. Those experiments, which were conducted in both field and laboratory settings, showed that people are more likely to make purchases or undertake optional classroom essay assignments when offered a limited array of 6 choices, rather than a more extensive array of 24 or 30 choices. Moreover, participants reported greater subsequent satisfaction with their selections and wrote better essays when their original set of options had been limited.

Information overload is a long-standing issue: in her 2010 book Too Much To Know: Managing Scholarly Information before the Modern Age, Harvard Department of History Professor Ann Blair argued that the early modern methods of selecting, summarizing, sorting, and storing text (the 4S’s) are at the root of the techniques we use today in information management. The construction of a well-crafted biomedical textual knowledge store (TKS) – with a focus on high quality, high impact material – partly addresses the issue of information overload. TKS provide a curated source of preselected and stored textual material, upon which programmatic approaches such as text mining and data visualization provide a more focused, deeper understanding. The application of advanced NLP and ML methods applied to TKS will assist the processing (e.g. clustering; dimensionality reduction; ranking; summarization; etc.) and understanding of PubMed/PubMed Central (PM/PMC) and other textual sources based on user-defined interests. In this regard, techniques such as active learning, analyses of user patterns and preferences could assist refined queries to external sources (PubMed and other online search engines), and the processing of those new data, in an increasingly accurate and focused iterative approach. The incorporation of vector space approaches and other “fuzzy” search paradigms could incidentally assist knowledge discovery. Algorithms and software acting as intelligent agents could automatically, autonomously and adaptively scan PM/PMC and other sources for new knowledge in multiple subject/topic areas; for example, monitoring the biomedical literature for new information relevant to genomic variants.

The new information that is retrieved from TKS may also be cross-queried against knowledge graphs to recover additional knowledge and discover new relations. The increasing availability of personalized genomic information and other personalized medical information will drive a demand for access to high quality TKS and methods to efficiently query and process those knowledge stores. Intelligently designed graphical user interfaces could allow the querying and viewing of those data (text; graphs; pathways and networks; images; etc.), per the users wishes.

[Table of Contents]

Text Classification

There is an increasing need for semisupervised and unsupervised tools that can pre-process, analyse and classify raw text to extract interpretable content; specifically, identifying topics and content-driven groupings of articles. One approach to information overload is to classify the documents, to grouping related information. Accurate document classification is a key component to ensuring quality of any digital library: the non-classification of documents impedes systems and hence users from finding useful information.

The Word Mover’s Distance (WMD), introduced in From Word Embeddings To Document Distances [codetutorialtutorial: Python implementation with WMD paper coauthor Matt Kusner)], a novel distance function between text documents, is based on results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum cumulative distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Although two documents may not share any words in common, WMD can still measure the semantical similarity by considering their word embeddings, while other bag-of-words or term frequency-inverse document frequency (TF-IDF) methods only measure the similarity by the appearance of words. The WMD metric has no hyperparameters and is straight-forward to implement. Furthermore, on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, the WMD metric demonstrated unprecedented low k-nearest neighbor document classification error rates.

In the biomedical domain, Bridging the Gap: Incorporating a Semantic Similarity Measure for Effectively Mapping PubMed Queries to Documents presented a query-document similarity measure motivated by WMD that relied on neural word embeddings to compute the distance between words. Unlike other similarity measures, their (WMD) method relied on neural word embeddings to compute the distance between words, which helped identify related words when no direct matches were found between a query and a document (e.g., as shown in Fig. 1 in that paper).

Aside: these authors also recently published PubMed Phrases, an Open Set of Coherent Phrases for Searching Biomedical Literature, a set of 705,915 PubMed Phrases.

In (Representation Learning of Entities and Documents from Knowledge Base Descriptions  [code] the authors described TextEnt, a neural network model that learned distributed representations of entities and documents directly from a knowledge base. Given a document in a knowledge base consisting of words and entity annotations, they trained their model to predict the entity that the document describes and map the document and its target entity close to each other in a continuous vector space. Their model (Fig. 2 in that paper) was trained using a large number of documents extracted from Wikipedia. TextEnt (which performed somewhat better than their Wikipedia2Vec baseline model) used the last, fully-connected layer to classify documents into a set of pretrained classes (their Table 3) [similar to how the fully-connected layer in various ImageNet image classification models classified images into predefined categories (and for which, removing the last fully-connected layer enabled the use of those models for transfer learning, as described in the Stanford cs231n course page Transfer Learning)].

Recent word embedding methods such as word2vec (Efficient Estimation of Word Representations in Vector Space) and fastText are capable of learning semantic meaning and similarity between words in an entirely unsupervised manner using a contextual window and doing so much faster than previous methods. [For an interesting discussion, see Word2Vec and FastText Word Embedding with Gensim.] FastText, an extension to Word2Vec proposed by Facebook AI Research in 2016 (Bag of Tricks for Efficient Text Compressing Text Classification Models), is a library for efficient learning of word representations and sentence classification. The model is an unsupervised learning algorithm for obtaining vector representations for words. Whereas word2vec cannot classify text, fastText can. FastText learns word embeddings in a manner very similar to word2vec except fastText enriches word vectors with subword information using character n-grams of variable length; for example, the tri-grams for the word “apple” are “app”, “ppl”, and “ple” (ignoring the starting and ending of boundaries of words). The word embedding vector for “apple” will be the sum of all of these n-grams. These character n-grams allow the algorithm to identify prefixes, suffixes, stems, and other phonological, morphological and syntactic structure in a manner that does not rely on words being used in similar context and thus being represented in similar vector space. After training rare words can now be properly represented, since it is highly likely that some of their n-grams also appear in other words. FastText represents an out-of-vocabulary medical term as the normalized sum of the vector representations of its trigrams.

  • Regarding fastText and out-of-vocabulary (OOV) words, note that text inputted into fastText is lowercased.

  • The effectiveness of word embeddings for downstream NLP tasks is limited by OOV words, for which embeddings do not exist. Mimicking Word Embeddings using Subword RNNs  [code presented MIMICK, an approach to generating OOV word embeddings compositionally by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK did not require retraining on the original word embedding corpus; instead, learning is performed at the type level.

    Critique: while an interesting approach; it’s a shame the published paper (pdf) is rather poorly-written, particularly viz-a-viz their discussion of fastText and the presentation of the data in their Table 3, which is very confusing.

  • Sebastian Ruder also discussed OOV; re: Mimicking Word Embeddings using Subword RNNs he stated “Another interesting approach for generating OOV word embeddings is to train a character-based model to explicitly re-create pre-trained embeddings (Pinter et al., 2017). This is particularly useful in low-resource scenarios, where a large corpus is inaccessible and only pre-trained embeddings are available.”

FastText was compared to other approaches on the classification of PubMed abstracts in Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research, where it performed very well. Interestingly, the fastText embeddings learnt on the entire English Wikipedia worked very well in that task. The diverse topics covered by Wikipedia may provide a rich corpus from which to learn text semantics. In addition, Wikipedia likely contains documents related to biomedical research such that the vocabulary is not as limited compared to Freebase and GoogleNews corpora. Performance using the GoogleNews embeddings is comparable to Pubmed and Pubmed+Wiki. This result suggests that learning embeddings in a domain-specific corpus is not a requirement for success in these tasks. That conclusion was echoed in A Comparison of Word Embeddings for the Biomedical Natural Language Processing, which among its conclusions found that the word embeddings trained on biomedical domain corpora do not necessarily have better performance than those trained on other general domain corpora for any downstream biomedical NLP tasks.

A recent, probabilistic extension of fastText, Probabilistic FastText for Multi-Sense Word Embeddings  [code, discussed elsewhere in this REVIEW)] produced accurate representations of rare, misspelt, and unseen words. Probabilistic FastText outperformed both fastText, which has no probabilistic model, and dictionary-level probabilistic embeddings, which do not incorporate subword structures, on several word-similarity benchmarks, including English rare word and foreign language datasets. It also achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words.

Work extending the concept of word embeddings to sentence, paragraph and document embeddings was introduced by Quoc V. Le and Tomas Mikolov (Google) as Paragraph Vectors in Distributed Representations of Sentences and Documents  [media: A gentle introduction to Doc2Vec], commonly known as doc2vec. There has some controversy as to whether doc2vec could outperform centroid methods, and others struggled to reproduce those results, leading Lau and Baldwin in An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation to perform an extensive comparison between various document embedding methods across different domains. They found that doc2vec performed robustly when using models trained on large external corpora, and could be further improved by using pretrained word embeddings. The general consensus was that different methods are best suited for different tasks; for example, centroids performed well on tweets, but are outperformed on longer documents. [For good discussions of various approaches, see Document Embedding and The Current Best of Universal Word Embeddings and Sentence Embeddings.]

  • An independent implementation for vector representation of documents, Doc2VecC (Efficient Vector Representation for Documents through Corruption; [code]) represented each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. Doc2VecC produced significantly better word embeddings than word2vec: the simple model architecture introduced by Doc2VecC matched or out-performed the state of the art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enabled training on billions of words per hour on a single machine; at the same time, the model was very efficient in generating representations of unseen documents at test time.

Doc2vec was used in a recent paper from Imperial College London, Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning  [code here and here], which described a methodology that brought together powerful vector embeddings from NLP with tools from graph theory (that exploited diffusive dynamics on graphs) to reveal natural partitions across scales. Their framework used doc2vec to represent text in vector form, then applied a multi-scale community detection method (Markov Stability) to partition a similarity graph of document vectors. The method allowed them to obtain clusters of documents with similar content, at different levels of resolution, in an unsupervised manner. An analysis of a corpus of 9,000 news articles showed consistent groupings of documents according to content without a priori assumptions about the number or type of clusters to be found.

In other work related to sentence embeddings, Unsupervised Learning of Sentence Representations Using Sequence Consistency from IBM Research proposed a simple yet powerful unsupervised method to learn universal distributed representations of sentences by enforcing consistency constraints on sequences of tokens, applicable to the classification of text and transfer learning. Their ConsSent model was compared to unsupervised methods (including GloVe, fastText, ELMo, etc.) and supervised methods (including InferSent, etc.) on a classification transfer task in their Table 1, where ConsSent performed very well, overall.

Aside: IBM Research also recently introduced DuoRC, a dataset for reading comprehension (RC) described in DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. DuoRC consists of 186K human-generated QA pairs of parallel movie plots, that contains very little or no lexical overlap (a paraphrase dataset, essentially).
Coincidentally (a few months later) Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition stated: “The main limitation of the current work is that there are insufficient paraphrase sentences for training the models.

Sentence embedding is an important research topic in NLP: it is essential to generate a good embedding vector that fully reflects the semantic meaning of a sentence in order to achieve an enhanced performance for various NLP tasks. Although two sentences may employ different words or different structures, people will recognize them as the same sentence as long as the implied semantic meanings are highly similar. Hence, a good sentence embedding approach should satisfy the property that if two sentences have different structures but convey the same meaning (i.e., paraphrase sentences), then they should have the same (or at least similar) embedding vectors. Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition, inspired by human language recognition, proposed the concept of “semantic coherence,” which should be satisfied for good sentence embedding methods: similar sentences should be located close to each other in the embedding space. P-thought was designed as a dual generation model, which received a single sentence as input and generated both the input sentence and its paraphrase sentence, simultaneously. Given a (sentence, paraphrase) sentence tuple, it should be possible to generate both the sentence itself and its paraphrase sentence from the representation vector of an input sentence. They employed a seq2seq structure with a gated recurrent unit (GRU) cell for the P-thought model. The encoder transformed the sequence of words of an input sentence into a fixed-sized representation vector, whereas the decoder generated the target sentence based on the given sentence representation vector. The proposed P-thought model had two decoders: when the input sentence was given, the first decoder, named “auto-decoder,” generated the input sentence as-is. The second decoder, named “paraphrase-decoder,” generated the paraphrase sentence of the input sentence.

P-thought pursued maximal semantic coherence during training. Compared to a number of baselines (their Table 3: bag of words, sent2vec, etc.) on the MS-COCO dataset, InferSent and P-thought far surpassed the other models, with P-thought slightly outperforming InferSent. In the case of P-thought with a one-layer Bi-RNN, the P-coherence value was comparable to that of InferSent (0.7454 and 0.7432, respectively); P-thought with a two layer forward RNN gave a score of 0.7899. Whereas P-thought with a two layer Bi RNN gave a much higher P-coherence score (0.9725), this was an over-training artefact. The main limitation of that current work was that there were insufficient paraphrase sentences for training the models: P-thought models with more complex encoder structures tended to overfit the MS-COCO datasets. Although this problem could be resolved by acquiring more paraphrase sentences, it was not easy for these authors to obtain a large number of paraphrase sentences. [In that regard, note my comments, above, on the DuoRC dataset, which over which *P-thought could be trained.*]

Google’s Semi-Supervised Sequence Learning  [code] used two approaches to improve sequence learning with long short-term memory (LSTM) recurrent networks. They first predicted what came next in a sequence via conventional NLP, and then used a sequence autoencoder which read the input sequence into a vector and predicted the input sequence again. These two algorithms could be used as an unsupervised pretraining step for a later supervised sequence learning algorithm. An important result from their experiments was that using more unlabeled data from related tasks in the pretraining improved the generalization (e.g. classification accuracy) of a subsequent supervised model. This was equivalent to adding substantially more labeled data, supporting the thesis that it is possible to use unsupervised learning with more unlabeled data to improve supervised learning. They also found that after being pretrained with the two approaches, LSTM are more stable and generalize better. Thus, this paper showed that it is possible to use LSTM for NLP tasks such as document classification, and that a language model or a sequence autoencoder can help stabilize the learning in LSTM. On five benchmarks, the LSTM reached or surpassed the performance levels of all previous baselines. In addition to the Google-provided code, the text-classification-models-tf repository also provides Tensorflow implementations of various text classification models. Another repository by that GitHub user, Transfer Learning for Text Classification with Tensorflow, provides a TensorFlow implementation of semi-supervised learning for text classification – an implementation of Google’s Semi-supervised Sequence Learning paper.

In Google Brain/OpenAI’s Adversarial Training Methods for Semi-Supervised Text Classification  [code | non-author codemedia], adversarial training provided a means of regularizing supervised learning algorithms while virtual adversarial training was able to extend supervised learning algorithms to the semi-supervised setting. However, both methods required making small perturbations to numerous entries of the input vector, which was inappropriate for sparse high-dimensional inputs such as one-hot word representations. The authors extended adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieved state of the art results on multiple benchmark semi-supervised and purely supervised tasks: the learned word embeddings were of higher quality, and the model was less prone to overfitting while training.

Hierarchical approaches to document classification include HDLTex: Hierarchical Deep Learning for Text Classification  [code]. Recently the performance of traditional supervised classifiers has degraded as the number of documents has increased, because accompanying the growth in the number of documents is an increase in the number of categories. This paper approached this problem differently from current document classification methods that view the problem as multiclass classification, instead performing “hierarchical classification.” Traditional multi-class classification techniques work well for a limited number classes, but performance drops with increasing number of classes, as is present in hierarchically organized documents. Hierarchical deep learning solves this problem by creating neural architectures that specialize deep learning approaches for their level of the document hierarchy. HDLTex employed stacks of deep learning architectures (RNN, CNN) to provide specialized understanding at each level of the document hierarchy. Testing on a data set of documents obtained from the Web of Science showed that combinations of RNN at the higher level and DNN or CNN at the lower level produced accuracies consistently higher than those obtainable by conventional approaches using naïve Bayes or a support vector machine (SVM).

Hierarchical Attention Networks for Document Classification  [non-author code] described a hierarchical attention network for document classification. Their model had two distinctive characteristics: a hierarchical structure that mirrored the hierarchical structure of documents, and it had two levels of attention mechanisms (word-level and sentence-level), enabling it to attend differentially to more and less important content when constructing the document representation. Experiments on 6 large scale text classification tasks demonstrated that the proposed architecture outperformed previous methods by a substantial margin. Visualization of the attention layers illustrated that the model selected qualitatively informative words and sentences.

Recent work from Stanford University (Training Classifiers with Natural Language Explanations  [codeworksheetauthor’s blog postdemo videodiscussion]) proposed a framework (BabbleLabble) for training classifiers in which annotators provided natural language explanations for each labeling decision. A semantic parser converted those explanations into programmatic labeling functions that generated noisy labels for an arbitrary amount of unlabeled data, which was used to train a classifier. On three relation extraction tasks, users were able to train classifiers with comparable F1 scores 5-100x faster by providing explanations instead of just labels. Based on F1 scores, a classifier trained with BabbleLabble achieved the same accuracy as a classifier trained only with end labels, using between 5x to 100x fewer examples. On the spouse task, 30 explanations were worth around 5,000 labels; on the disease task 30 explanations were worth around 1,500 labels; and on the protein task, 30 explanations were worth around 175 labels.

  • This project is part of the Snorkel project (a training data creation and management system focused on information extraction), the successor to the now deprecated DeepDive project. Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

[Table of Contents]

Text Summarization

Approximately 1.28 million articles were added to PubMed in 2017, including ~0.36 million full-text articles added to PubMed Central, at the rate of ~3,485 new articles per day (queried 2018-06-29; see also my blog post). Of those, ~122,381 included the word “cancer” in the title or abstract, i.e. ~335 papers/day (PubMed query 2017[dp] AND cancer[tiab] executed 2018-06-29; note the capitalized Boolean). Narrowing the search to 2017[dp] AND 'breast cancer'[tiab] or 2017[dp] AND 'acute myeloid leukemia'[tiab] returned 16,706 and 2,030 articles (45.77 and 5.56 articles/day), respectively. The following command-line query shows the numbers of PubMed publications per indicated year (queried on the indicated date: PubMed continually adds older, previously non-indexed articles):

With those data in ~/pm.dat, gnuplot -p -e 'plot "~/pm.dat" notitle' gives this plot:


Accurate text classification and summarization is needed help address, in part, the information overload arising from the enormous volume and growth of the PM/PMC biomedical literature. Existing tools for searching and classifying PM/PMC documents exist (e.g. the NCBI PubMed search engine itself, the BEST Biomedical Entity Search Tool, etc.), but what is needed are methods to refine those classifications through faceted searches over previously returned results, and visualizations. Arguably more important, given the sheer volume of the PM/PMC literature, are methods to summarize that literature.

Text summarization generally falls into one of two algorithmic approaches: extractive summarization, which summarizes text by copying parts of the input, and abstractive summarization systems, which generate new phrases (possibly rephrasing or using words that were not in the original text). Abstractive summarization tends to be more concise than extractive summarization, which also tends to be more repetitive and burdened with non-relevant text. However, extractive summarization is easier to implement, and can provide unaltered evidentiary snippets of text.

Extractive Summarization

The Word Mover’s Distance (WMD; described in the preceding subsection) was applied the extractive summarization task in Efficient and Effective Single-Document Summarizations and A Word-Embedding Measurement of Quality  (note Fig. 2). WMD uses word2vec (Tomas Mikolov et al.: Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality) as a word embedding representation method. WMD measures the dissimilarity between two documents and calculates the minimum cumulative distance to “travel” from the embedded words of one document to the other.

Data-driven Summarization of Scientific Articles [datasetsslides] likewise applied WMD to the biomedical domain, comparing that approach to others in a very interesting and revealing study on extractive and abstractive summarization. The examples presented in their Tables 4 and 5 demonstrated very clearly the differences in extractive summarization over full length articles. While the results for title generation were promising, the models struggled with generating the abstract, highlighting the necessity for developing novel models capable of efficiently dealing with long input and output sequences, while at the same time preserving the quality of generated sentences.

Ranking Sentences for Extractive Summarization with Reinforcement Learning  [codelive demo] conceptualized extractive summarization as a sentence ranking task. While many extractive summarization systems are trained using cross-entropy loss in order to maximize the likelihood of the ground-truth labels, they do not necessarily learn to rank sentences based on their importance due to the absence of a ranking-based objective. [Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.] In this paper the authors argued that models trained on cross-entropy training are prone to generating verbose summaries with unnecessarily long sentences and redundant information. They proposed overcoming these difficulties by globally optimizing the ROUGE evaluation metric and learning to rank sentences for summary generation through a reinforcement learning objective. Their neural summarization model, REFRESH (REinFoRcement Learning-based Extractive Summarization) consisted of a hierarchical document encoder and a hierarchical sentence extractor. During training, it combined the maximum-likelihood cross-entropy loss with rewards from policy gradient reinforcement learning to directly optimize the evaluation metric relevant for the summarization task. The model was applied to the CNN and DailyMail datasets on which it outperformed baseline, state of the art extractive and abstractive systems when evaluated automatically and by humans. They showed that their global optimization framework rendered extractive models better at discriminating among sentences for the final summary, and that the state of the art abstractive systems evaluated lagged behind the extractive ones, when the latter are globally trained.

Abstractive Summarization

In Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond  [non-author code], researchers at IBM Watson and the Université de Montréal modeled abstractive text summarization using attentional encoder-decoder recurrent neural networks. That approach was extended (different authors) by Richard Socher and colleagues at SalesForce in A Deep Reinforced Model for Abstractive Summarization, which described a sophisticated, highly performant reinforcement learning-based system for abstractive text summarization that set the state of the art in this domain (An Algorithm Summarizes Lengthy Text Surprisingly Well | Your TL;DR by an AI: A Deep Reinforced Model for Abstractive Summarization). Socher’s work used an attention mechanism and a new (machine) learning objective to address the “repeating phrase” problem, via:

  • an intra-temporal attention mechanism in the bidirectional long short-term memory (Bi-LSTM) encoder that records previous attention weights for each of the input tokens (words), while a sequential intra-attention model in the LSTM decoder takes into account which words have already been generated by the decoder: i.e. an encoder-decoder network; and,

  • a new objective function that combines the maximum-likelihood cross-entropy loss, used in prior work with rewards from policy gradient reinforcement learning, to reduce exposure bias.

The encoder-decoder allowed the model to generate new words that were not part of the input article, while the copy-mechanism allowed the model to copy over important details from the input even if these symbols were rare in the training corpus. At each decoding step the intra-temporal attention function attended over specific parts of the encoded input sequence in addition to the decoder’s own hidden state and the previously-generated word. This kind of attention prevented the model from attending over the sames parts of the input on different decoding steps. Intra-temporal attention can reduce the amount of repetitions when attending over long documents. While this intra-temporal attention function ensured that different parts of the encoded input sequence were used, their decoder could still generate repeated phrases based on its own hidden states, especially when generating long sequences. To prevent that, the authors incorporated more information about the previously decoded sequence into the decoder. To generate a token [word], the decoder used either a token-generation softmax layer or a pointer mechanism to copy rare or unseen tokens from the input sequence. [In this regard, note that the probabilistic fastText algorithm could also deal with rare and out-of-vocabulary (OOV) words.] A switch function decided at each decoding step whether to use the token generation, or the pointer.

  • As a proprietary system code for this work is not available, but there are four Python implementations available on GitHub (keyphrase search “Deep Reinforced Model for Abstractive Summarization”), as well as an OpenNMT implementation, that also links to GitHub.

  • Follow-on work by Socher and colleagues includes Improving Abstraction in Text Summarization which proposed two techniques to improve the level of abstraction of generated summaries. First, they decomposed the decoder into a contextual network that retrieved relevant parts of the source document, and a pretrained language model that incorporates prior knowledge about language generation. The contextual network had the sole responsibility of extracting and compacting the source document, whereas the language model is responsible for the generation of concise paraphrases. Second, they proposed a novelty metric that was optimized directly through policy learning (a reinforcement learning reward) to encourage the generation of novel phrases (summary abstraction) .

In related work from Peking University, Global Encoding for Abstractive Summarization  [code] recently developed a model with an encoder similar to that in the Socher/SalesForce approach, employing a Bi-LSTM decoder that generated summary words. Their approach differed from Socher’s method in that they fed the encoder output at each time step to a convolutional gated unit, that with a self-attention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global source-side information. Self-attention encouraged the model to learn long-term dependencies, without creating much computational complexity. Since the convolutional module could extract n-gram features of the whole source text and self-attention learned the long-term dependencies among the components of the input source text, the gate (based on the generation from the CNN and self-attention module for the source representations from the RNN encoder) could perform global encoding on the encoder outputs. Based on the output of the CNN and self-attention, the logistic sigmoid function outputted a vector of value between 0 and 1 at each dimension. If the value was close to 0, the gate removed most of the information at the corresponding dimension of the source representation, and if it was close to 1 it reserved most of the information. The model thus performed neural abstractive summarization through a global encoding framework, which controlled the information flow from the encoder to the decoder based on the global information of the source context, generating summaries of higher quality while reducing repetition.

Christopher Manning’s group at Stanford University, in collaboration with Google Brain, also employed pointer-generator networks (used by Socher/Salesforce, above) in their well-cited abstractive summarization method, Get to the Point: Summarization with Pointer-Generator Networks. Coauthor Abigail See discussed this work in her excellent post Taming Recurrent Neural Networks for Better Summarization. This approach first used a hybrid pointer-generator network which could copy words from the source text via pointing, that aided accurate reproduction of information while retaining the ability to produce novel words through the generator. The approach then used “coverage” to keep track of what had been summarized, which discouraged repetition. [“Coverage” refers to the coverage vector (Modeling Coverage for Neural Machine Translation), which keeps track of the attention history.]

Although a commercial (closed source) project, the blog article Machine-Generated Knowledge Bases described an abstractive summarization approach that was applied to create “missing” biographies that should exist in Wikipedia, in an interesting product demonstration. That approach could assist with addressing the information overload associated with the volume of the PubMed/PubMed Central literature; the tools they used (TLDR: Re-Imagining Automatic Text SummarizationBuilding seq-to-seq Models in TensorFlow (and Training Them Quickly] could be implemented relatively easily via the approaches described in this REVIEW. Much of that work, for example, is based on See and Manning’s Get To The Point: Summarization with Pointer-Generator Networks approach, discussed in the preceding paragraph; consequently, approaches to reimplement/extend’s Quicksilver abstractive summarization project via seq2seq models with attention are well in hand.

A very interesting and promising project from Google Brain, Generating Wikipedia by Summarizing Long Sequences [codeOpenReviewmedia], considered English Wikipedia as a supervised machine learning task for multi-document summarization where the input was comprised of a Wikipedia topic (title of article) and a collection of non-Wikipedia reference documents, and the target was the Wikipedia article text. They described the first attempt to abstractively generate the first section (lead) of Wikipedia articles conditioned on reference text. They used extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. In addition to running strong baseline models on the task, they modified the Transformer architecture (Vaswani et al. (2017) Attention Is All You Need) to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. They showed that their modeling improvements allowed them to generate entire Wikipedia articles.

Another very interesting paper (August 2018), Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization  [dataset, code, and demo: not available, 2018-08-28] introduced extreme summarization (XSum), a new single-document summarization task which did not favor extractive strategies and called for an abstractive modeling approach. The idea was to create a short, one-sentence news summary answering the question “What is the article about?”. Their novel abstractive model, conditioned on the article’s topics, was based entirely on CNN. They demonstrated that this architecture captured long-range dependencies in a document and recognized pertinent content, outperforming an oracle extractive system and state of the art abstractive approaches when evaluated automatically, and by humans. The example illustrated in their Fig. 1 is very impressive, indeed (note. e.g., the substitution of “a small recreational plane” with “light”):

A man and a child have been killed after a light aircraft made an emergency landing on a beach in Portugal.

Authorities said the incident took place on Sao Joao beach in Caparica, south-west of Lisbon. The National Maritime Authority said a middle-aged man and a young girl died after they were unable to avoid the plane.
[… 6 sentences/139 words not shown here …]
Other reports said the victims had been sunbathing when the plane made its emergency landing.
[… 4 sentences/67 words not shown here …]
Video footage from the scene carried by local broadcasters showed a small recreational plane parked on the sand, apparently intact and surrounded by beachgoers and emergency workers.
[… Last 2 sentences/19 words not shown here.]

As can be seen, the summary is very different from a headline; it draws on information interspersed in various parts of the document and displays multiple levels of abstraction including paraphrasing, fusion, synthesis, and inference. This work built upon a dataset for the proposed task by harvesting online news articles that often included a first- sentence summary. They further proposed a novel deep learning model for the extreme summarization task: unlike most existing abstractive approaches which relied on RNN based encoder-decoder architectures, they presented a topic-conditioned neural model which was based entirely on CNN. Convolution layers captured long-range dependencies between words in the document more effectively than RNN, allowing it to perform document-level inference, abstraction, and paraphrasing. The convolutional encoder associated each word with a topic vector (capturing whether it was representative of the document’s content), while the convolutional decoder conditioned each word prediction on a document topic vector.

In summary, while reasonably-effective methods exist for extractive summarization, abstractive summarization represents an active research area, for which rudimentary code is available on GitHub (a GitHub search with the search phrases “abstractive summarizer” and “abstractive summarization” returns numerous repositories).

[Table of Contents]


[Table of Contents]

Knowledge Graph Construction and Completion

Knowledge graphs (relational property graphs) model information in the form of entities (nodes/vertices) and the relationships between them (edges); for a formal definition see the “Problem Definition” in Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs, and for a general overview see Towards a Definition of Knowledge Graphs. A noteworthy feature of knowledge graphs is the excellent performance of traversal-type queries across entities of diverse types within them. Such queries can be challenging to realize in relational databases because of the cost of computing statements (joins) across different tables: see, e.g., Making Sense of Graph Databases Part 4 - Why are Graph Databases More Efficient than RDMS on Connected DataComparison of Relational Databases and Graph Databases, and search for “joins” in this Neo4j Ebook: The Definitive Guide to Graph Databases for the RDBMS Developer. Related to this point, Cypher queries in Neo4j are easier to write and understand than complex SQL queries in RDBMS, especially those involving multiple join statements (p. 23 in The Definitive Guide to Graph Databases for the RDBMS Developer).

Knowledge graphs (KG) provide semantically structured information that is interpretable by computers – an important property to building more intelligent machines, as well as an important step to transforming text-based knowledge stores and search engines into semantically-aware question answering services (A Review of Relational Machine Learning for Knowledge Graphs). Knowledge graphs are applicable to a broad spectrum of problems ranging from biochemistry to recommender systems; for example, question answering, structured search, exploratory search, and digital assistants. Examples of the use of KG in the biomedical domain include:

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction  [data/code] introduced a joint multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. They created “SciERC,” a dataset that included annotations for all three tasks, and developed a unified framework called Scientific Information Extractor (SciIE) with shared span representations. The multi-task setup reduced cascading errors between tasks and leveraged cross-sentence relations through coreference links. Their multi-task model was better at predicting span boundaries and outperformed previous state of the art scientific IE systems on entity and relation extraction, without using any hand-engineered features or pipeline processing. SciIE was able to automatically organize the extracted information from a large collection of scientific articles into a knowledge graph. Our analysis shows the importance of coreference links in making a dense, useful graph. Experiments showed that their multi-task model outperformed previous models in scientific information extraction without using any domain-specific features. We further show that the framework supported construction of a scientific knowledge graph, which they used to analyze information in scientific literature.

[Table of Contents]

Statistical Relational Learning

Graphs represent a recent and exciting extension of machine learning (for a good review see The Knowledge Graph as the Default Data Model for Learning on Heterogeneous Knowledge). Although a typical knowledge graph may contain millions of entities and billions of relational facts, it is usually far from complete. Determining the validity of information in a knowledge graph and filling in its missing parts are crucial tasks for many researchers and practitioners. Traditional fact checking by experts and analysts cannot keep pace with the volume of newly created information. It is therefore important and necessary to enhance our ability to computationally determine whether statements of fact are true or false. This problem may be viewed as a link-prediction task. Knowledge graph completion also aims at predicting relations between entities under supervision of the existing knowledge graph, which is an important supplement to relation extraction from plain texts. To address those challenges, a number of knowledge graph completion methods have been developed using link prediction and low-dimensional graph embeddings.

A Review of Relational Machine Learning for Knowledge Graphs provides an excellent introduction to machine learning and knowledge graphs, with a focus on statistical relational learning (SRL). The authors demonstrated how SRL can be used in conjunction with machine reading and information extraction methods to automatically build KG. In SRL, the representation of an object can contain its relationships to other objects. The data is in the form of a graph, consisting of nodes (entities) and labelled edges (relationships between entities). The main goals of SRL include prediction of missing edges, prediction of properties of nodes, and clustering nodes based on their connectivity patterns. These tasks arise in many settings such as analysis of social networks and biological pathways.

Aside: here, SRL represents Statistical Relational Learning (sometimes called Relational Machine Learning, RML), a subdiscipline of machine learning that is concerned with domain models that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in statistical relational learning use (a subset of) first-order logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming. Statistical relational learning the field is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and first-order probabilistic languages (emphasizing the key properties of the languages with which models are represented).

Statistical relational learning should not be confused with Semantic Role Labeling (sometimes also called shallow semantic parsing), which is also abbreviated SRL. Semantic role labeling is a process in NLP that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles (i.e., automatically finding the semantic roles of each argument of each predicate in a sentence).

An excellent follow-on article (different authors) to A Review of Relational Machine Learning for Knowledge Graphs is On Embeddings as Alternative Paradigm for Relational Learning, which systematically compares knowledge graph embedding (KGE) and logic-based statistical relational learning methods on standard relational classification and clustering tasks. A strong advantage of KGE is their scalability, at the expense of their black-box nature and limited reasoning capabilities.

“Ontology reasoning” is the ephemeral name given to reasoning over ontologies and knowledge bases: deriving facts that are not explicitly expressed in an ontology or in knowledge base. Ontology reasoning describes problem settings where the rules for conducting reasoning are specified alongside with the actual information (see Fig. 1 in Ontology Reasoning with Deep Neural Networks). In 2017 Hohenecker and Lukasiewicz (University of Oxford) presented a novel approach to ontology reasoning that was based on deep learning rather than logic-based formal reasoning (Deep Learning for Ontology Reasoning). They introduced a new model for statistical relational learning built upon deep recursive neural networks which easily competed with existing logic-based reasoners on the task of ontology reasoning on “RDFox” (an in-memory RDF triple store, i.e. graph, being developed as a commercial product). Their follow-on work (2018), Ontology Reasoning with Deep Neural Networks devised a novel model that was closely coupled to symbolic reasoning methods, and thus able to learn how to effectively perform basic ontology reasoning. Tests showed that their model learned to perform precise reasoning on a number of diverse inference tasks that required comprehensive deductive proficiencies; furthermore, the suggested model suffered much less from different obstacles that prohibit symbolic reasoning.

[Table of Contents]

Knowledge Graph Embedding

In a radically different approach from the symbolic methods employed in statistical relational learning, knowledge graph embedding (KGE) methods aim at representing entities and relations in a knowledge graph as points or vectors in a continuous vector space (Euclidean space) – simplifying manipulation, while preserving the inherent structure of the KG (for recent reviews see A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications and Knowledge Graph Embedding: A Survey of Approaches and Applications). KGE approaches are the current dominating methodology for knowledge graph link prediction (On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols). Such a methodology, fundamentally based on distributed representations, has not only proved to be effective for KG link prediction, but has also helped to improve our understanding and engineering of knowledge representation.

The underlying concept of KGE is that in a knowledge graph each entity can be regarded as a point in a continuous vector space while relations can be modelled as translation vectors (Expeditious Generation of Knowledge Graph Embeddings). The generated vector representations can be used by machine learning algorithms to accomplish a specific tasks. Restated in On Embeddings as Alternative Paradigm for Relational Learning, KGE aim to represent instances and their relationships as vectors and/or matrices in the Euclidean space. The hope is that the geometry of the embedding space would resemble the structure of the data (for example) by keeping the instances participating in the same relationships close in the Euclidean space. This in turn allows one to apply standard propositional (logic-based) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data.

KGE has proven to be very effective for the task of knowledge graph completion (A Review of Relational Machine Learning for Knowledge Graphs; KGE is reviewed in An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion), where the goal is to identify missing links in the existing knowledge graph. KGE has also proven to be scalable to very large knowledge graphs.

Aside – regarding the discussion above:

  • A recent Microsoft/Google paper, Link Prediction using Embedded Knowledge Graphs, discusses the differences between symbolic and vector space in KG completion; instead of sampling branching paths in the symbolic space of the original knowledge graph, they perform short sequences of interactive lookup operations in the vector space of an embedded knowledge graph.

  • Euclidean space is a mathematical construct that encompasses the line, the plane, and three-dimensional space as special cases. Its elements are called vectors. Vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars. Vectors can be understood in various ways: as arrows, as quantities with magnitude and direction, as displacements, or as points. Euclidean vectors are an example of a vector space. They represent physical quantities such as forces: any two forces (of the same type) can be added to yield a third, and the multiplication of a force vector by a real multiplier is another force vector. In the same vein, but in a more geometric sense, vectors representing displacements in the plane or in three-dimensional space also form vector spaces. Vectors are abstract mathematical objects with particular properties, which in some cases can be visualized as arrows. However, vectors in vector spaces do not necessarily have to be arrow-like objects. Vector spaces are the subject of linear algebra and are well characterized by their dimension, which, roughly speaking, specifies the number of independent directions in the space.

  • KG are generally represented by symbolic triples – (source, relation, target), i.e. (subject, predicate, object) – whereas KGE methods attempt to represent those symbols (nodes and edges) with their corresponding source, relation, and target vectors – amenable to mathematical processing.

A typical KG is represented as a graph whose nodes are entities and edges are relations between heads and tails (Knowledge Graph Embedding with Multiple Relation Projections). While this raw representation is adequate to store known knowledge, relating distant entities requires expensive graph traversal, possibly through multiple paths. Thus knowledge graph completion calls for learning of a new representation that supports scalable reasoning. The most successful approach thus far is through embedding entities and relations into a continuous vector space, which naturally lends itself to simple algebraic manipulations. A well known method is TransE (Translating Embeddings for Modeling Multi-relational Data), which embeds entities and relations into the same space where the difference between head and tail is approximately the relation [TransE adopts the principle of geometric translation, formally as h + r ≈ t]. While this embedding permits very simple translation-based relational inference, it is too restrictive in dealing with 1-to-N, N-to-1 and N-to-N relations. Furthermore, the representation of these models is still not semantically interpretable, which may be a major flaw and harm the potential applications (KSR: A Semantic Representation of Knowledge Graph within a Novel Unsupervised Paradigm). In the traditional methods it is almost impossible to exactly extract the specific semantics from geometric points; for the example of TransE, a representation of “Table” as (0.82, 0.51, …) hardly tells us anything meaningful – such a table being furniture, not an animal, etc. Without semantics, the gap between knowledge and language remains, limiting the task of knowledge applications and natural language understanding (NLU).

One effective solution is to consider two separate embedding spaces for entities and relations (Knowledge Graph Embedding with Multiple Relation Projections). Entities are then mapped into the relation space using relation-specific projections , such as those in TransR (Learning Entity and Relation Embeddings for Knowledge Graph Completion). This mapping strategy, however, causes critical drawbacks. First, when the number of relations is large, the whole projection matrices are expensive to model. Second, treating each relation separately does not account for the latent structure in the relation space, leading to waste of resources. An example of such a latent structure is the correlation between relations “nationality” and “place-of-birth”, as the latter may infer about the former.

(Knowledge Graph Embedding with Multiple Relation Projections) proposed a new translation-based method called TransF, which was inspired by TransR but did not suffer from these problems. Under TransF, projection matrices are members of a matrix space spanned by a fixed number of matrix bases. A relation-specific projection matrix is characterized by a relation-specific coordinate in the space. Put in other way, the relation projection tensor is factorized into product of a relation coordinate matrix and a basis tensor. Hence, TransF is much more efficient and robust than TransR.

[Table of Contents]

Multi-View Graph Embedding

It is relatively easy to embed metabolic pathways (or experiments, etc.) as a graph structures. As metabolism is a dynamic, temporal process the graph should also reflect/represent metabolic states (metabolite concentrations; health status: healthy, diabetic, breast cancer; etc.) – perhaps through different graph representations. Likewise, metadata encoded in a graph representing an experiment could reflect changes in experimental variables over time, or in response to different experimental conditions (controls vs. experiments).

Cellular growth and development provides an excellent example: a totipotent fertilized egg or a pluripotent stem cell can give rise to any other cell type – blood, liver, heart, brain, etc. – each arising from differential expression of the identical genome encoded within these cells (DNA Methylation in Stem Cell Renewal and MultipotencyEpigenetic Regulation of Pluripotency and DifferentiationGermline and Pluripotent Stem CellsEpigenetics of Reprogramming to Induced Pluripotency). This cellular differentiation largely arises due to selective epigenetic modification of the genome during growth and development.

It would be exceptionally useful to be able to access multiple “views” of those types of graphs, depending on the metadata and embeddings.

Recent work in knowledge graph embedding such as TransE, TransH and TransR have shown competitive and promising results in relational learning. Non-Parametric Estimation of Multiple Embeddings for Link Prediction on Dynamic Knowledge Graphs proposed a novel extension [Parallel Universe TransE (puTransE)] of the translational embedding model to solve three main problems of the current models:

  • translational models are highly sensitive to hyperparameters such as margin and learning rate;

  • the translation principle only allows one spot in vector space for each golden triplet; thus, congestion of entities and relations in vector space may reduce precision;

  • current models are not able to handle dynamic data, especially the introduction of new unseen entities/relations or removal of triplets.

The proposed approach explicitly generated multiple embedding spaces via semantically and structurally aware triplet selection schemes, and non-parametrically estimated the energy score of a triplet. The intuition for the approach was that in every parallel universe embedding space, a constraint on triplets was imposed in terms of count and diversity such that each embedding space observed the original knowledge graph from a different view. The proposed puTransE approach was simple, robust and parallelizable. Experimental results showed that puTransE outperformed TransE and many other embedding methods for link prediction on knowledge graphs on both public benchmark dataset and a real world dynamic dataset.

KSR: A Semantic Representation of Knowledge Graph within a Novel Unsupervised Paradigm proposed a semantic representation method for knowledge graph (KSR) which imposed a two-level hierarchical generative process that globally extracted many aspects and then locally assigns a specific category in each aspect for every triple. Notably, they introduced a “knowledge features” (i.e. views for clustering) to describe some knowledge semantic aspects, such as being a university or not (University), geographical position (Location), etc. By gathering the information of clusters in each view/feature, the semantic representations are formed. The first-level process generated knowledge features (views) with different semantics such as University and Location, while the second-level process grouped the entities/relations/triples according to the corresponding semantic features (views). Finally, by summarizing the cluster identifications within each feature/view, KSR constructed the semantic representation of the knowledge elements. For example, for “Tsinghua University” a Yes category (cluster identification) is assigned to the University feature, while the Beijing category/cluster identification is assigned to the Location feature. By exploiting multi-view clustering, this knowledge was semantically organized as Tsinghua University = (University:Yes, Location:Beijing).

Multi-view Clustering with Graph Embedding for Connectome Analysis from the University of Illinois at Chicago addressed multi-view clustering on graph instances with their model Multi-view Clustering framework on graph instances with Graph Embedding (MCGE), in which they modeled multi-view graph data as tensors and applied tensor factorization to learn the multi-view graph embeddings thereby capturing the local structure of graphs. They built an iterative framework incorporating multi-view graph embedding into the multi-view clustering task on graph instances, jointly performing multi-view clustering and multi-view graph embedding simultaneously. The multi-view clustering results were used for refining the multi-view graph embedding, and the updated multi-view graph embedding results further improved the multi-view clustering. Extensive experiments on two brain network datasets (HIV and Bipolar) demonstrated the superior performance of the proposed MCGE approach in multi-view connectome analysis for clinical investigation and application. In a simple example (their Fig. 1) of a simple two-view example of the MCGE problem, given fMRI and DTI brain networks of five subjects, MCGE aims to learn multi-view graph embedding for each of them, and cluster these subjects into different groups based on the obtained multi-view graph embeddings. In the HIV and bipolar brain network examples (their Fig. 4 and 5, respectively), differential brain region clustering maps (where each color-coded node represented a brain region and each edge indicated the correlation between two brain regions) of normal controls and patients could be compared. For example, nodes of the normal brain network were well grouped into several clusters, while nodes in the HIV brain network were less coherent. Additionally, for the normal control edges within each cluster were much more intense than the edges across different clusters.

  • This work was cited in earlier work (also in 2017) by the same authors, Multi-view Graph Embedding with Hub Detection for Brain Network Analysis. In that paper, they proposed incorporating the hub detection task into the multi-view graph embedding framework so that the two tasks could benefit each other, via an auto-weighted framework of Multi-view Graph Embedding with Hub Detection (MVGE-HD) for brain network analysis (again, HIV and bipolar brain networks). The MVGE-HD framework learned a unified graph embedding across all the views while reducing the potential influence of the hubs on blurring the boundaries between node clusters in the graph, thus leading to a clear and discriminative node clustering structure for the graph.

  • Follow-on work (2018) by these authors, Multi-View Multi-Graph Embedding for Brain Network Clustering Analysis proposed Multi-view Multi-graph Embedding (M2E) by stacking multi-graphs into multiple partially-symmetric tensors and using tensor techniques to simultaneously leverage the dependencies and correlations among multi-view and multi-graph brain networks. Although there has been work on single-graph embedding and multi-view learning, there has been no embedding method available which enabled preserving multi-graph structures (i.e. taking multiple graphs as input) on multiple views. The goal of M2E is to find low-dimensional representations from multi-view multi-graph data which reveal patterns and structures among the brain networks. Experiments (again) on real HIV and bipolar disorder brain network datasets demonstrated the superior performance of M2E on clustering brain networks by leveraging the multi-view multi-graph interactions.

    • Critique: While these authors survey clustering accuracy of a number of baselines against M2E (“We compare the proposed M2E with eight other methods for multi-view clustering on brain networks.”), in a rather suspect omission they do not include their earlier MCGE or MVGE-HD work (cited in that paper as Ma et al. 2017a,b) among these models.

Existing approaches to learning node representations usually study networks with a single type of proximity between nodes, which defines a single view of a network. However, in reality there usually exists multiple types of proximities between nodes, yielding networks with multiple views. An Attention-based Collaboration Framework for Multi-View Network Representation Learning  [code] studied learning node representations for networks with multiple views, which aimed to infer robust node representations across different views. They proposed a multi-view representation learning approach (MVE), which promoted the collaboration of different views, letting them vote for robust representations. During the voting process, an attention mechanism was introduced, which enabled each node to focus on the most informative views. Experimental results on real-world networks showed that the proposed approach outperformed state of the art approaches for network representation learning (node classification; link prediction) with a single or multiple views.

mvn2vec: Preservation and Collaboration in Multi-View Network Embedding focused on the characteristics that were specific and important in embedding multi-view networks. They identified two characteristics, “preservation” and “collaboration” and explored the feasibility of achieving better embedding quality by simultaneously modeling preservation and collaboration. As shown in their Fig. 1a, a multi-view network consists of multiple network views, where each view corresponds to a type of edge, and all views share the same set of node.

  • This paper, and An Attention-based Collaboration Framework for Multi-View Network Representation Learning, above are conceptually similar (compare Fig. 1 in each paper). Dr. Jiawei Han, University of Chicago at Urbana-Champaign, is also a coauthor on both papers.

  • Collaboration: In some datasets, edges between the same pair of nodes may be observed in different views due to shared latent reasons. For instance, if nodes in two views may complement (interact with) each other in various social media contexts, embedding them jointly may potentially yield better results than embedding them independently.

Preservation: On the other hand, it is possible for different network views to have different semantic meanings; it is also possible that a portion of nodes have completely disagreeing edges in different views since edges in different views are formed due to distinct latent reasons. For example, professional relationships may not align well with friendships. If we embed the profession and the friendship views into the same embedding space, the embedding fails to preserve the unique information carried by the different network views. The authors refer to the need for preserving unique information carried by different views as “preservation.”

  • It is also possible for preservation and collaboration to co-exist in the same multi-view network.

In a multi-view graph embedding, each node is assigned a vector that incorporates data from all views of the graph (for this paragraph see the Introduction in Multi-View Graph Embedding Using Randomized Shortest Paths and the references cited therein). Simple methods to create multi-view embeddings include combining multiple views of the graph into one graph using an AND/OR aggregation of the edge sets and embedding the resulting single graph, or embedding each view independently and concatenating the different embeddings obtained for each node. More sophisticated algorithms have been developed based on matrix factorization, tensor factorization, and spectral embedding. Many of these algorithms focus on clustering multi-view graphs, a specific application thereof. High clustering accuracy indicates a good embedding since relative similarity between nodes should be correctly reflected in the embedding. … . Ideally, graph embeddings should preserve the distances between the nodes in their respective node embeddings.

Multi-View Graph Embedding Using Randomized Shortest Paths  [experimental results proposed a generalized distance on multi-view graphs called the Common Randomized Shortest Path Dissimilarity (C-RSP), based on the randomized shortest path (RSP) dissimilarity measure on single-view graphs. This algorithm generated a dissimilarity measure between nodes by minimizing the expected cost of a random walk between any two nodes across all views of a multi-view graph, in doing so encoding both the local and global structure of the graph. This leads to more accurate graph embeddings, resulting in better visualization and high clustering accuracy.

Unsupervised Multi-view Nonlinear Graph Embedding addressed “multi-view” graph embedding: given a graph with node features they aimed to learn a network embedding (for each node from its network features) and a content embedding (from its content features) simultaneously for each node in an unsupervised manner. Generally, network structure and node features are considered as two different “views” for a node in the graph; in this work the authors allowed the network and content “views” to reinforce one another, to obtain better graph embeddings. They proposed a simple, effective unsupervised Multi-viEw nonlineaR Graph Embedding (MERGE) model. Inspired by structural deep network embedding, MERGE encoded the nonlinearity of the network/content by taking the network/content features as input, and then applying a deep autoencoder to learn a nonlinear network/content embedding for each node. MERGE also preserved the second-order proximity by extending DeepWalk to the multi-view setting: MERGE extended DeepWalk enabling one node’s embedding to interpret its “neighbor” node’s content embedding, enforcing cross-instance-cross-view consistency. MERGE consistently outperformed the state of the art baselines over five public datasets including PubMed (their Table 2) with O(|V| complexity, with one third the training data.

End-to-End Multi-View Networks for Text Classification  [non-author code here and herediscussion] from National Research Council Canada proposed a “multi-view network” for text classification. It is outperformed by CoVe in Socher’s Learned in Translation: Contextualized Word Vectors (their Table 4).

[Table of Contents]

Link prediction, the discovery of relations such as (subject, relation, object) triples, is tremendously useful for knowledge graph construction/embedding/completion, fact checking, and knowledge discovery.

Embeddings of knowledge graphs have received significant attention due to their excellent performance for tasks like link prediction and entity resolution. In Complex and Holographic Embeddings of Knowledge Graphs: A Comparison Théo Trouillon and Maximilian Nickel provided a comparison of two state of the art knowledge graph embeddings: ComplEx and HolE. They briefly reviewed both models, discussing how their scoring functions were equivalent, and then analyzed the discrepancy of results reported in the original articles – showing experimentally that they are likely due to the use of different loss functions. They also discussed advantages and disadvantages of both models and under which conditions one would be preferable to the other.


  • Although they do not label the models in their Table 1, from their discussion of the loss models, “Margin” refers to their HolE model, whereas “Neg-LL” refers to Trouillon et al.’s ComplEx model.

  • The filtered MRR value (0.541) that they report for HolE on the FB15k dataset in this 2017 paper differs from the value (0.524) that they reported in their earlier 2016 and 2017 papers.

  • Likewise, the filtered MRR value (0.639) that they report for ComplEx on the FB15k dataset in this 2017 paper differs from the value (0.692) that they reported in their earlier 2016 and 2017 papers.

  • Excellent non-author code, discussion and examples for HolE (Holographic Embeddings of Knowledge Graphs  [author’s code]) – by Maximilian Nickel, Lorenzo Rosasco and Tomaso Poggio at MIT – is found in this GitHub repository. Holographic embeddings are an approach to generating entity embeddings from a list of (head, tail, relation) triples like (“Jeff”, “Amazon”, “employer”) and (“Zuck”, “Palo Alto”, “location”). Embeddings can be used as lossy (but memory-efficient) inputs to other machine learning models, used directly in triple inference (i.e. knowledge base completion, or link prediction) by evaluating the likelihood of candidate triples, or to search for associated entities using k-nearest neighbors. For example, a search from the embedding representing the entity “University of California, Berkeley” yields the associated entities “UC Irvine”, “Stanford University”, “USC”, “UCLA”, and “UCSD” (see the figure in the GitHub repo).

Improving Knowledge Graph Embedding Using Simple Constraints  [code] investigated the potential of using very simple constraints to improve KG embedding. They examined non-negativity constraints on entity representations and approximate entailment constraints on relation representations (hence, “ComplEx-NNE+AER” – an extension of the original ComplEx, described by Trouillon et al. in Complex Embeddings for Simple Link Prediction and their follow-on paper Knowledge Graph Completion via Complex Tensor Factorization). Non-negativity constraints on entity representations help learn compact and interpretable representations for entities, whereas approximate entailment constraints on relation representations further encode regularities of logical entailment between relations into their distributed representations. The constraints impose prior beliefs upon the structure of the embedding space, without negative impacts on efficiency or scalability. For each relation, it assumes a score matrix whose sign matrix is partially observed. The entries corresponding to factual and nonfactual triples have the sign 1 and -1, respectively.


  1. Regarding non-negativity constraints, for technical reasons the variables of linear programs must always take non-negative values; for example, the linear inequalities x ≥ 0 and y ≥ 0 specify that you cannot produce a negative number of items.

  2. Regarding the approximate entailment constraints, these authors mean an ordered pair of relations such that the former approximately entails the latter – e.g., BornInCountry and Nationality – stating that a person born in a country is very likely, but not necessarily, to have a nationality of that country.]

Analogical Inference for Multi-Relational Embeddings  [code] imposes analogical properties on embeddings. Each relation is represented as a normal matrix, which are constrained to satisfy commutativity properties. ANALOGY represents a novel framework for optimizing the latent representations with respect to the “analogical” properties of the embedded entities and relations for knowledge base completion. Analogical inference posits that if subsets of entities and relations are analogous in systems A and B, then the unobserved triples in B could be inferred by mirroring their counterparts in A. The proposed approach obtained state of the art results on two popular benchmark datasets, outperforming a large number of strong baselines in most cases.

Knowledge Graph Embedding with Iterative Guidance from Soft Rules  [code] presented a novel approach for knowledge graph embedding (KGE) combined with guidance from soft logic rules, called Rule-Guided Embedding (RUGE), that provided state of the art results in KG link prediction (compared e.g. to ComplEx: their Table 3). This work builds on previous work by these authors (that provides additional information): Jointly Embedding Knowledge Graphs and Logical Rules (which introduced KALE, an approach that learned entity and relation embeddings by jointly modeling knowledge and logic), and Knowledge Base Completion using Embeddings and Rules (which employed an integer linear programming approach plus logic rules that leveraged prior knowledge in the KG to greatly reduce the solution space during inference, i.e. link prediction).

PredPath”, described in Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs  [code] presented a discriminative path-based method for fact checking in KG that incorporated connectivity, type information, and predicate interactions. Given a statement in the form of a (subject, predicate, object) triple – for example, (Chicago, capitalOf, Illinois) – their approach mined discriminative paths that alternatively defined the generalized statement (U.S. city, predicate, U.S. state) and used the mined rules to evaluate the veracity of that statement.

Follow-on work by the PredPath authors, ProjE: Embedding Projection for Knowledge Graph Completion  [code] presented a shared variable neural network model that filled in missing information in a KG by learning joint embeddings of the KG entities and edges (collectively calculating the scores of all candidate triples). ProjE had a parameter size smaller than 11 out of 15 existing methods while performing 37% better than the current best method on standard datasets. They also showed, via a fact checking task, that ProjE was capable of accurately determining the veracity of many declarative statements.

In July 2017 Convolutional 2D Knowledge Graph Embeddings  [code] introduced ConvE, a deeper, multi-layer convolutional network model for link prediction. ConvE was highly parameter efficient, yielding the same performance as DistMult and R-GCN [*R-GCN is discussed below] with 8x and 17x fewer parameters.

“Analysis of our model suggests that it is particularly effective at modelling nodes with high indegree – which are common in highly-connected, complex knowledge graphs such as Freebase and YAGO3. In addition, it has been noted that the WN18 and FB15k datasets suffer from test set leakage, due to inverse relations from the training set being present in the test set – however, the extent of this issue has so far not been quantified. We find this problem to be severe: a simple rule-based model can achieve state-of-the-art results on both WN18 and FB15k.”

ConvE was the inspiration (different authors) for the August 2018 paper Hypernetwork Knowledge Graph Embeddings  [code], which used a hypernetwork architecture to generate convolutional layer filters specific to each relation and apply those filters to the subject entity embeddings. Their model (HypER) simplified the entity and relation embedding interactions introduced by ConvE while outperforming all previous approaches to link prediction across all standard link prediction datasets. HypER used a hypernetwork to generate weights of convolutional filters for each relation. [A hypernetwork is an approach of having one network generate weights for another network, which can be used to provide weight-sharing across layers and to dynamically synthesize weights based on an input.] In this work, they generated relation-specific weights to process input entities, and also obtained multi-task knowledge sharing across different relations in the knowledge graph. Whereas ConvE had a global common set of 2-dimensional convolutional filters (suggesting the presence of a 2D structure in word embeddings) that reshaped and concatenated subject entity and relation embeddings, HypER used a hypernetwork to generate 1-dimensional relation-specific filters to process the unadjusted subject entity embeddings, thereby simplifying the interaction between subject entity and relation embeddings. Thus, instead of extracting a limited number of both entity-related and relation-related features from the dimensions around the concatenation point, HypER covered all subject entity embedding dimensions by sliding the relation-dependent convolutional filters over the whole entity embedding.

In August 2018 a new model for knowledge graph completion, CapsE, was introduced in A Capsule Network-based Embedding Model for Knowledge Graph Completion and Search Personalization. CapsE employed a capsule network (discussed elsewhere in this review) to model (subject, relation, object) relationship triples. CapsE obtained state of the art link prediction results for knowledge graph completion on two benchmark datasets: WN18RR and FB15k-237. Comparing Table 2 in that paper to Table 4 in On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols appears to confirm CapsE as a state of the art model, the first to apply capsule networks to knowledge graph completion and search personalization.

Predicting Semantic Relations using Global Graph Properties combined global and local properties of semantic graphs through the framework of Max-Margin Markov Graph Models (M3GM), a novel extension of Exponential Random Graph Model (ERGM) that scales to large multi-relational graphs. They demonstrated how such global modeling improves performance on the local task of predicting semantic relations between synsets, yielding new state of the art results on the WN18RR dataset, a challenging version of WordNet link prediction in which “easy” reciprocal cases were removed. In addition, the M3GM model identified multirelational motifs that were characteristic of well-formed lexical semantic ontologies.

One-Shot Relational Learning for Knowledge Graphs  [code] by the University of California (Santa Barbara) and IBM Research observed that long-tail relations are common in KGs (in other words, they have very few instances) and those newly added relations often do not have many known triples for training. In this work, they aimed at predicting new facts under a challenging setting where only one training instance was available. They proposed a one-shot relational learning framework, which utilized the knowledge extracted by embedding models and learned a matching metric by considering both the learned embeddings and one-hop graph structures. Empirically, their model yielded considerable performance improvements over existing embedding models, and also eliminated the need of retraining the embedding models when dealing with newly added relations. These authors also prepared two new datasets, for their work.

[paraphrased:] “Existing benchmarks for knowledge graph completion, such as FB15k-237 and YAGO3-10 are small subsets of real-world KGs. These datasets consider the same set of relations during training and testing and often include sufficient training triples for every relation. To construct datasets for one-shot learning, we go back to the original KGs and select those relations that do not have too many triples as one-shot task relations. We refer the rest of the relations as background relations, since their triples provide important background knowledge for us to match entity pairs. Our first dataset is based on NELL, a system that continuously collects structured knowledge by reading webs. We take the latest dump and remove those inverse relations. We select the relations with less than 500 but more than 50 triples 3 as one-shot tasks. To show that our model is able to operate on large-scale KGs, we follow the similar process to build another larger dataset based on Wikidata. The dataset statistics are shown in Table 1. Note that the Wiki-One dataset is an order of magnitude larger than any other benchmark datasets in terms of the numbers of entities and triples.”

My Table 1, below, summarizes MRR scores for link prediction for selected, best-performing embedding models ca. mid-2018. Around that time, ComplEx was used by Facebook AI Research as the state of the art for comparison of various models in Canonical Tensor Decomposition for Knowledge Base Completion  [code], which framed knowledge base completion as a 3rd-order binary tensor completion problem.

Table 1. Link prediction results: comparison of various embedding models (filtered MRR metric1)
1. This list is current to approximately late August, 2018.
2. I did not exhaustively search the literature, only the papers mentioned in this subsection.
3. From those papers, I selected ("cherry picked") the data from some of the better-performing models discussed in this subsection, and from those again selected those models that are discussed herein. I also only collected/reported the MRR scores, as I wanted an uncluttered comparison of those models.
4. References cited may not be the primary source (refer to the references cited therein).
1Filtered MRR: filtered mean reciprocal rank [refer here] for a definition.
    WN18 is a subset of WordNet which consists of 18 relations and 40,943 entities ("generic facts"). Most of the 151,442 triples consist of hyponym and hypernym relations and, for such a reason, WN18 tends to follow a strictly hierarchical structure.

    WN18RR corrects flaws in WN18:
      "WN18RR reclaims WN18 as a dataset, which cannot easily be completed using a single rule, but requires modeling of the complete knowledge graph. WN18RR contains 93,003 triples with 40,943 entities and 11 relations. For future research, we recommend against using FB15k and WN18 and instead recommend FB15k-237, WN18RR, and YAGO3-10."

      "A popular relation prediction dataset for WordNet is the subset curated as WN18, containing 18 relations for about 41,000 synsets extracted from WordNet 3.0. It has been noted that this dataset suffers from considerable leakage: edges from reciprocal relations such as hypernym/hyponym appear in one direction in the training set and in the opposite direction in dev/test. This allows trivial rule-based baselines to achieve high performance. To alleviate this concern, Dettmers et al. (2018) released the WN18RR set, removing seven relations altogether. However, even this dataset retains four symmetric relation types: 'also see', 'derivationally related form', 'similar to', and 'verb group'. These symmetric relations can be exploited by defaulting to a simple rule-based predictor." [Source: Section 4.1 in Predicting Semantic Relations using Global Graph Properties; references therein.]
    FB15k is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. A large fraction of content in this knowledge graph describes facts about movies, actors, awards, sports, and sport teams.

    FB15k-237  (see also), which corrects errors in FB15k, contains about 14,541 entities with 237 different relations. "This dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. The knowledge base triples are a subset of the FB15k set. The textual mentions are derived from 200 million sentences from the ClueWeb12 corpus coupled with FACC1 Freebase entity mention annotations."

    YAGO3-10 is a subset of YAGO3 which consists of entities which have a minimum of 10 relations each. It has 123,182 entities and 37 relations. Most of the triples deal with descriptive attributes of people, such as citizenship, gender, and profession.

    YAGO37 is extracted from the core facts of YAGO3, containing 37 relations and 123,189 entities. This dataset was created by the RUGE authors: "The FB15K data set consists of 1,345 relations and 14,951 entities among them. The training set contains 483,142 triples, the validation set 50,000 triples, and the test set 59,071 triples. 454 rules are created for FB15K. The YAGO37 data set consists of 37 relations and 123,189 entities among them. The training set contains 989,132 triples, the validation set 50,000 triples, and the test set 50,000 triples. 16 rules are created for YAGO37. All triples are unique and we made sure that all entities/relations appearing in the validation or test sets were occurring in the training set."
Model [ref] FB15k FB15k-237 WN18 WN18RR YAGO3-10 YAGO37
ANALOGY [1] 0.723 0.211 0.942 0.391 0.257
ANALOGY [7] 0.725 0.942
CapsE [8] 0.538 0.391
ComplEx [1] 0.716 0.206 0.942 0.390 0.266
ComplEx [10] 0.692 0.247 0.941 0.440 0.340
ComplEx [11] 0.69 0.240 0.941 0.444
ComplEx [5] 0.692 0.941
ComplEx [6] 0.639 0.941
ComplEx-NNE+AER [3] 0.803 0.943
ConvE [10] 0.657 0.325 0.943 0.430 0.440
ConvE [4] 0.745 0.942
ConvE [8] 0.316 0.460
ConvE [9] 0.301 0.342
ConvE [11] 0.745 0.301 0.942 0.342
DistMult [10] 0.654 0.241 0.822 0.430 0.340
DistMult [11] 0.35 0.241 0.83 0.425
DistMult [3] 0.644 0.365
DistMult [5] 0.654 0.822
HolE [6] 0.541 0.938
HolE [5] 0.524 0.938
HypER [2] 0.762 0.341 0.951 0.463
M3GM [12] 0.4983
ProjE [1] 0.588 0.249 0.820 0.367 0.470
R-GCN [10] 0.696 0.248 0.814
R-GCN+ [9] 0.696 0.249 0.819
RUGE [3] 0.768 0.431
TransE [1] 0.456 0.219 0.584 0.191 0.151
TransE [3] 0.400 0.303
TransE [5] 0.380 0.454
TransE [8] 0.294 0.226
TransE [9] 0.463 0.294 0.495 0.226
TransE [12] 0.4659
TransF [11] 0.564 0.286 0.856 0.505

From the table above we can see that there is considerable variation among the filtered MRR values for link prediction by the various models on the specified datasets. Per the footnotes in that table – although there are fewer comparisons at present – we should probably assign greater performance to models that score well on the more challenging datasets:

  • HypER (which reassuringly also scored the best among the models on WN18) on the WN18RR dataset;

  • CapsE on the FB15k-237 dataset;

  • RUGE on the YAGO37 dataset, which contains more relation types than YAGO3 or YAGO3-10.

In an interesting approach, Finding Streams in Knowledge Graphs to Support Fact Checking  [code] viewed a knowledge graph as a “flow network” and knowledge as a fluid, abstract commodity. They showed that computational fact checking of a (subject, predicate, object) triple then amounts to finding a “knowledge stream” that emanates from the subject node and flows toward the object node through paths connecting them. Evaluations revealed that this network-flow model was very effective in discerning true statements from false ones, outperforming existing algorithms on many test cases. Moreover, the model was expressive in its ability to automatically discover useful path patterns and relevant facts that may help human fact checkers corroborate or refute a claim.

Predictive Network Representation Learning for Link Prediction presented the Predictive Network Representation Learning (PNRL) model, to solve the structural link prediction problem. The proposed model defined two learning objectives: observed structure preservation, and hidden link prediction. [Network representation learning models learn the latent representations of nodes, which can embed the rich structural information into the latent space. Most of these models are learned in an unsupervised manner.] To integrate the two objectives in a unified model, they developed an effective sampling strategy to select certain edges in a given network as assumed hidden links and regard the rest network structure as observed when training the model. By jointly optimizing the two objectives, the model not only enhanced the predictive ability of node representations, but also learned additional link prediction knowledge in the representation space.

An interesting approach to KG construction involves the Biological Expression Language (BEL), which represents findings in the life sciences in a computable form. Biological statements in BEL are represented as (subject, predicate, object) triples, where the subject is a BEL term, the predicate is a biological relationship that connects the subject with the object (which can be a BEL term or a BEL statement). Hence, the knowledge captured by BEL statements can be represented as a graph. A BEL term represent the abundance of a biological entity, or a biological process such as a Gene Ontology entry or a disease. Each entity is described in an existing namespace (describing, for example, it’s source database, e.g. ChEBI), or a user-defined namespace. A fixed set of causal, correlative and other relationships link the entities: each statement can be associated with metadata that contextualizes it, for example qualifying it to be true only in specific tissues.

While BEL is an active research area with Tracks in the BioCreative VI Community Challengesp (see also) and preceding BioCreative Tracks, the approach appears to have gained little traction outside that community. For example, only a handful of BEL-related papers are present in PubMed that are not focused on the development of BEL, itself. For more information on BEL, see:

[Table of Contents]

Knowledge Discovery

KG are ideally-suited for knowledge discovery. Notable examples in the biomedical domain include the following papers.

In Exploiting Semantic Patterns Over Biomedical Knowledge Graphs for Predicting Treatment and Causative Relations  [pdf], the authors first built a large knowledge graph of biomedical relations obtained from the National Library of Medicine (NLM)’s Unified Medical Language System Metathesaurus (UMLS). They then took a different approach to predicting potential relations between arbitrary pairs of biomedical entities. They refrained from NLP approaches that looked at individual sentences to extract a potential relation, instead exploiting semantic path patterns over this graph to build models for specific predicates. Instead of looking at what a particular sentence conveys, they modeled their prediction problem at a global level, outputting probability estimates of whether a pair of entities participated in a particular relation. A different binary classification model was trained for each predicate. While the approach was demonstrated using the “TREATS” and “CAUSES” predicates drawn from the UMLS Metathesaurus SemMedDB (Semantic Medline Database), their method also generalized to other predicates (such as “DISRUPTS” and “PREVENTS”), and could also complement other lexical and syntactic pattern-based distant supervision approaches for relation extraction.

MOLIERE: Automatic Biomedical Hypothesis Generation System  [projectcode] is a system that can identify connections within biomedical literature. MOLIERE utilized a multi-modal/multi-relational network of biomedical objects (papers; keywords; genes; proteins; diseases; diagnoses) extracted from PubMed. MOLIERE finds the shortest path between two query keywords in the KG, and extends this path to identify a significant set of related abstracts (which, due to the network construction process, share common topics). Topic modeling, performed on these documents using PLDA+, returns a set of plain text topics representing concepts that likely connect the queried keywords, supporting hypothesis generation (for example, on historical findings MOLIERE showed the implicit link between Venlafaxine and HTR1A, and the involvement of DDX3 on Wnt signaling).

Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases  [code/data (the URL for the code is broken)] described work that jointly learned concept vectors from a textual knowledge base (Wikipedia) and a graphical knowledge base (Probase), demonstrating superior results (their Table 1) on analogical reasoning and concept categorization. The authors employed the skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph, in an unsupervised approach to argument-type identification for neural semantic parsing.

In Learning a Health Knowledge Graph from Electronic Medical Records  [data], maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier, and a Bayesian network using noisy-OR gates. Logistic regression, widely used for binary classification, was chosen as an example of a well-established machine learning classifier with interpretable parameters which is frequently used for modeling binary outcomes. Naive Bayes was chosen as it provides a baseline of what can be inferred from simple pairwise co-occurrences. Noisy-OR was chosen as an example of a probabilistic model that jointly models diseases and symptoms; similar models have successfully been used in previous medical diagnosis applications. Parameters for all three models were learned using maximum likelihood estimation. A graph of disease-symptom relationships was elicited from the learned parameters, and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. The noisy-OR model significantly outperformed the other tested models, producing a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation.

  • Careful reading of that paper and its Appendix, which describes the assumptions associated with each of those three models, clarifies the approach. Diseases (and their effects: symptoms) were separately modeled by logistic regression and naive Bayes; additionally, in naive Bayes, all symptoms were assumed to be conditionally independent from one another. However, the noisy-OR model jointly models all diseases and symptoms [wherein the central assumption of independence of effects (symptoms) is maintained]. Noisy-OR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). For example, given that patients tend to present with few diseases, the presence of one disease typically lowers the probability of others. In the logistic regression and naive Bayes models, however, since each disease is modeled separately (resulting in an assumption of independence between diseases), the presence of one disease does not (for example) rule out or lessen the probability of another disease (i.e. diagnosis).

  • Author David Sontag also evaluates noisy-or maximum likelihood estimation in Clinical Tagging with Joint Probabilistic Models. See also Sontag’s paper, Unsupervised Learning of Noisy-Or Bayesian Networks.

  • Code for Learning a Health Knowledge Graph from Electronic Medical Records is not available. However, the noisy-OR model was also employed in a 2009 paper, Biomedical Discovery Acceleration, with Applications to Craniofacial Development which stated (paraphrased):

    • “One of the most popular methods to combine individual reliabilities is to assume independence of experts (naive Bayes assumption) and compute the integrated likelihood P for each relationship using the Noisy-OR function … The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support. This property is especially relevant in biology, where it is often difficult to identify false negatives; a given assertion is strengthened by additional information but unlike the case for estimating the reliability of an expert on the whole, an individual assertion is not penalized for lack of additional evidence. Moreover, since the experts are assumed to be independent, experts can be removed from or added to the analysis without excessive re-computation.”
  • Their approach is implemented in a tool, “Hanalyzer” [project page/code here and herevideoCytoscape plugin], which was used to reason over gene expression data and the biomedical data, creating hypotheses regarding the roles of genes never previously characterized as involved in craniofacial development.

Regarding natural language understanding and reasoning, a recent review/perspective paper (Relational Inductive Biases, Deep Learning, and Graph Networks  [media; critique] – jointly from DeepMind, Google Brain, MIT and the University of Edinburgh – described how graph networks provide a straightforward interface for manipulating structured knowledge and producing structured behaviors. The paper argued that their Graph Networks could effectively support two critical human-like capabilities: relational reasoning (i.e. drawing logical conclusions of how different objects and things relate to one another), and combinatorial generalization (i.e. constructing new inferences, predictions, and behaviors from known building blocks). Their Graph Networks framework defined a class of functions for relational reasoning over graph structured representations. They discussed how Graph Networks could support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. While “natural language understanding” is not discussed, per se, it is likely that NLU will be integral to the relational reasoning paradigm that is described in that paper.

[Table of Contents]

Graph Structure

Here, I summarize work which leverages the representational and topological structure of knowledge graphs, that can be used to learn latent relationships or aid KG construction (e.g., through link prediction).

Network embedding methods can be divided into two categories (all discussed below – at times excerpted from Improved Semantic-Aware Network Embedding with Fine-Grained Word Alignment):

  • methods that solely rely on structure (e.g. vertex information), such as DeepWalk, LINE, node2vec, etc.; and,

  • methods that leverage both the structure the network and the semantic information associated with its vertices, such as TADW (Text-Associated DeepWalk), CENE (Content-Enhanced Network Embedding), Context-Aware Network Embedding (CANE), and Word-Alignment-based Network Embedding (WANE).

Similar to the Skip-Gram concept (Mikolov et al. (2013) originally introduced for learning word embeddings, DeepWalk generalized language modeling and unsupervised feature learning from sequences of words to graphs, using local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences (DeepWalk: Online Learning of Social Representations). In other words, DeepWalk learns latent representations of vertices in a network in a continuous vector space via truncated random walks. Likewise, LINE proposed a principled objective to explicitly capture first-order and second-order proximity information from the vertices of a network (LINE: Large-scale Information Network Embedding). Subsequently, node2vec introduced a biased random walk procedure to generate the neighborhood for a vertex, which inferred the node representations by maximizing the likelihood of preserving the local context information of vertices (node2vec: Scalable Feature Learning for Networks).

The DeepWalk approach is very popular; some recent applications of DeepWalk in the biomedical domain include the following works.

  • In the biomedical domain, graph structure was leveraged by Alshahrani et al. (Neuro-Symbolic Representation Learning on Biological Knowledge Graphs), who developed a method for feature learning on biological knowledge graphs. Their method combined symbolic methods (specifically, knowledge representation using symbolic logic and automated reasoning) with neural networks to generate vector representations (embeddings) of the nodes in those graphs, as well as the kinds of relations that existed to neighboring nodes. To learn those representations, they repeatedly performed random walks from each node in the KG, using the resulting walks as sentences within a corpus, and applied a word2vec skip-gram model (DeepWalk) to learn the embeddings for each node. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. These embeddings were applied to the prediction of edges in the KG, applicable to tasks such as the prediction of biological function, finding candidate genes of diseases, and identifying protein-protein interactions or drug-target relations.

  • An extension of Alshahrani’s work, Neuro-symbolic representation learning on biological knowledge graphs  (second preceding paragraph), was described by Agibetov and Samwald in Fast and Scalable Learning of Neuro-Symbolic Representations of Biomedical Knowledge. They showed how to train fast (less than one minute, vs. hours previously) log-linear neural embeddings (representations) of the entities, that were used as inputs for ML classifiers enabling important tasks such as biological link prediction. Classifiers were trained by concatenating learned entity embeddings to represent entity relations, and training classifiers on the concatenated embeddings to discern true relations from automatically generated negative examples. Their simple embedding methodology greatly improved on classification error compared to previously published state of the art results, yielding increased F-measure and ROC AUC scores for the most difficult biological link prediction problem. Their embedding approach was also much more economical in terms of embedding dimensions (d=50 vs. d=512), and naturally encoded the directionality of the asymmetric biological relations, that could be controlled by the order with which the embeddings were concatenated.

  • Notably, in the work described by Agibetov and Samwald Fast and Scalable Learning of Neuro-Symbolic Representations of Biomedical Knowledge in the preceding paragraph, there was no need for a sophisticated labeled DeepWalk, since all considered biological relations have clear non-overlapping domain and range separations (and therefore the whole knowledge graph can be treated as an untyped directed graph, where there is no ambiguity in the semantics of any relation). Instead of using DeepWalk, they trained faster and more economical log-linear neural embeddings using Jason Weston’s (Facebook AI Research) StarSpace: Embed All the Things!  [code]. StarSpace is a general-purpose neural embedding model that aims at learning entities, each of which is described by a set of discrete features (bag-of-features) coming from a fixed-length dictionary. StarSpace can solve a wide variety of problems including labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings. The classifiers used by Agibetov and Samwald Fast and Scalable Learning of Neuro-Symbolic Representations of Biomedical Knowledge (previous paragraph) were trained on concatenated embeddings of entities (nodes), which were obtained from the flattened graphs for each biomedical relations using StarSpace.

  • DeepWalk was recently applied to the prediction of microRNA-disease associations, by calculating similarities within a miRNA-disease association network (Predicting MicroRNA-Disease Associations using Network Topological Similarity Based on DeepWalk). This approach showed superior predictive performance for 22 complex diseases, with area under the ROC curve scores ranging from 0.805 to 0.937, using five-fold cross-validation. In addition, case studies on breast, lung and prostate cancer further justified the use the method for the discovery of latent miRNA-disease pairs. DeepWalk has also been used to predict novel drug-target associations (Deep Mining Heterogeneous Networks of Biomedical Linked Data to Predict Novel Drug-Target Associations).

struc2vec: Learning Node Representations from Structural Identity is yet another vector space-based model, which like DeepWalk and node2vec learned latent representations for the structural identity of nodes. While DeepWalk and node2vec failed to capture the notion of structural identity, struc2vec excelled on this task, even when the original network was subject to strong random noise (random edge removal). Fig. 3 in the struct2vec paper showed superior node representations for the toy Karate network, compared to DeepWalk and node2vec.

Another recent improvement over node2vec, graph2vec: Learning Distributed Representations of Graphs  [code] learned data-driven distributed representations of arbitrary sized graphs, rather than learning distributed representations of graph substructures. graph2vec’s embeddings were learned in an unsupervised manner and were task agnostic. Hence, those representations could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Experiments on several benchmark and large real-world datasets showed that graph2vec achieved significant improvements in classification and clustering accuracies over substructure representation learning approaches. For example, compared to node2vec, graph2vec resulted in increased accuracy on five chemoinformatic and bioinformatic domains (see results in Subsection 5.1 and Table 2 in the graph2vec paper; datasets used there are described/available herehere and here). Comparisons of graph2vec to struct2vec are not available at this time; however, the approach and performance of struct2vec over node2vec are clearly evident in Figs. 2, 3 and 7 in the struct2vec paper.

Syntree2Vec - An Algorithm to Augment Syntactic Hierarchy into Word Embeddings  [code] aimed to include syntactic knowledge – here, preserving the word order during sampling – into word embeddings via a graph based embedding algorithm inspired from node2vec. Using SyntaxNet ( SyntaxNet Models for the CoNLL 2017 Shared Task;  Google AI blog: Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source) for syntactic/dependency parsing, word embeddings were generated by optimizing a graph based objective function that maximized the probability of preserving a network neighborhood. A semi-guided second order random walk procedure then sampled network neighborhood of nodes. In an example of their graph creation (their Figs. 1, 2), the Authors parsed two similar sentences and concatenated the two graphs, that preserved the word order among the syntactic parse trees. Syntree2vec performed similarly to word2vec and node2vec (perplexity scores), albeit with training times that scaled with the data size (unlike word2vec; Fig. 3) for the data sizes evaluated (Table 1). [Data were drawn from a 2008 Wikipedia data dump.]

Deep Neural Networks for Learning Graph Representations provided an alternative deep learning approach to learning graph representations (especially word representations), via a random surfing model to directly capture graph structural information. Stacked denoising autoencoders were used to extract complex features and to model non-linearities, generating informative representations. The learned representations could be regarded as input features that could be fed into other tasks, such as unsupervised and supervised classification. Note Fig. 3 in their paper, which is reminiscent of Fig. 3 in in the struct2vec paper, and the comparison of each of those approaches in those figures to DeepWalk.

However, the various graph structure algorithms discussed above generally ignore rich heterogeneous information associated with vertices (Improved Semantic-Aware Network Embedding with Fine-Grained Word Alignment). Models that learn semantic-aware network embeddings include the following.

Despite showing improvement over structure-only models, those semantic-aware methods could not capture word-level alignment information – important for inferring the relationship between node pairs. For a wide range of applications, vertices in a network are typically accompanied by rich textual information such as user profiles, paper abstracts, etc. Improved Semantic-Aware Network Embedding with Fine-Grained Word Alignment focused on incorporating semantic features into network embeddings by matching important words between text sequences for all pairs of vertices. They introduced a word-by-word alignment framework, Word-Alignment-based Network Embedding (WANE), that measured the compatibility of embeddings between word pairs, and then adaptively accumulated those alignment features with an aggregation function. Evaluations on three real world benchmarks for downstream tasks including link prediction and multi-label vertex classification demonstrated that their model outperformed state of the art network embedding methods by a large margin. [” Our code will be released to encourage future research.” ]

[Table of Contents]

Graph Convolutional Neural Networks

Graph neural networks (GNN) have revolutionized the field of graph representation learning through effectively learned node embeddings, achieving state of the art results in tasks such as node classification and link prediction. However, current GNN methods are inherently flat and do not learn hierarchical representations of graphs – a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph. “DiffPool” is a differentiable graph pooling module that can generate hierarchical representations of graphs, and can be combined with various graph neural network architectures in an end-to-end fashion (Hierarchical Graph Representation Learning with Differentiable Pooling). DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer. Experimental results showed that combining existing GNN methods with DiffPool yielded an average improvement of 5-10% accuracy on graph classification benchmarks compared to existing pooling approaches, achieving a new state of the art on four out of five benchmark data sets.

An important early paper that applied a neural network to a graph structure is Kipf and Weller’s Semi-Supervised Classification with Graph Convolutional Networks, which introduced graph convolutional neural networks (GCN). Thomas Kipf provides an excellent introduction to GCN to GCN in his blog post “Graph Convolutional Networks”. When Work Matters: Transforming Classical Network Structures to Graph CNN also provides a recent review of GCN, as well as transformation of the basic structures of ResNet, Inception and DenseNet as GCN. GCN provide a scalable approach for semi-supervised learning on graph structured data based on an efficient variant of convolutional neural networks which operate directly on graphs. GCN have been used to explore graph structure, e.g. web-scale recommender systems (Graph Convolutional Neural Networks for Web-Scale Recommender Systems). Notably in that work:

  • Here we describe a large-scale deep recommendation engine that we developed and deployed at Pinterest. We develop a data-efficient Graph Convolutional Network (GCN) algorithm PinSage, which combines efficient random walks and graph convolutions to generate embeddings of nodes (i.e., items) that incorporate both graph structure as well as node feature information. Compared to prior GCN approaches, we develop a novel method based on highly efficient random walks to structure the convolutions and design a novel training strategy which relies on harder-and-harder training examples to improve robustness and convergence of the model. … The core idea behind GCNs is to learn how to iteratively aggregate feature information from local graph neighborhoods using neural networks (Fig. 1). Here a single ‘convolution’ operation transforms and aggregates feature information from a node’s one-hop graph neighborhood, and by stacking multiple such convolutions information can be propagated across far reaches of a graph. Unlike purely content-based deep models (e.g., recurrent neural networks), GCNs leverage both content information as well as graph structure. GCN-based methods have set a new standard on countless recommender system benchmarks.

Neural Graph Machines: Learning Neural Networks Using Graphs  [pdfnon-author codecritical review], from the University of Cambridge and Google Research, discusses and implements various neural graphical models. That work generalised previous literature on graph-augmented training of neural networks, enabling it to be applied to multiple neural architectures (feed-forward neural networks, CNN and LSTM) and a wide range of graphs. The new objective allowed the neural networks to harness both labeled and unlabeled data by allowing the network to train using labeled data as in the supervised setting, and biasing the network to learn similar hidden representations for neighboring nodes on a graph (in the same vein as label propagation). The proposed joint training approach convincingly outperformed many existing methods on a wide range of tasks (multi-label classification on social graphs, news categorization, document classification and semantic intent classification), with multiple forms of graph inputs (including graphs with and without node-level features) and using different types of neural networks.

Large-Scale Learnable Graph Convolutional Networks  [code] proposed the Learnable Graph Convolutional Layer (LGCL), which automatically selected a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1D format, thereby enabling the use of regular convolutional operations on generic graphs. To enable model training on large-scale graphs, they proposed a subgraph training method to reduce the excessive memory and computational resource requirements suffered by prior methods on graph convolutions on large-scale graph data. Experiments on node classification tasks demonstrated consistently better performance on the Cora, Citeseer and Pubmed citation networks, and protein interaction network datasets. The proposed subgraph training strategy was also more efficient than prior approaches.

Uses of GCN include:

  • Modeling Relational Data with Graph Convolutional Networks  [code (Thomas Kipf); code], which introduced relational graph convolutional networks (R-GCN) and applied them to two standard knowledge base completion tasks: link prediction (the recovery of missing facts, i.e. subject-predicate-object triples), and entity classification (the recovery of missing entity attributes). R-GCN are related to a recent class of neural networks operating on graphs, and are developed specifically to deal with the highly multi-relational data characteristic of knowledge bases.

  • FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling  [code] extended the previous work, R-GCN (Modeling Relational Data with Graph Convolutional Networks), above – which was originally designed to be learned with the presence of both training and test data; moreover, for which the recursive neighborhood expansion across layers posed time and memory challenges for training with large, dense graphs. To relax the requirement of simultaneous availability of test data the newer model, FastGCN, interprets graph convolutions as integral transformations of embedding functions under probability measures. That interpretation allows the use of Monte Carlo approaches to consistently estimate the integrals, that in turn leads to a batched training scheme. FastGCN is efficient for training and generalizes well for inference. FastGCN was discussed in A Scalable Deep Learning Approach for Massive Graphs, which stated:

    • A challenge for GCN is that for a network with multiple layers, the neighborhood is quickly expanded, involving many vectors to be summed together, for even just a single node. Such a computation is prohibitively costly for graphs with a large number of nodes. where the neighborhood expands rapidly, making computation very costly. To resolve this challenge, the authors propose a novel approach (FastGCN), validated through mathematical proofs and experimental results, that suggests that it suffices to gather the information of only a handful of random entities in each neighborhood expansion. This substantial reduction in neighborhood size renders the same quality of prediction as state of the art deep neural networks but cuts training cost by orders of magnitude (e.g., 10x to 100x less computation and resource time), leading to appealing scalability.
  • Disease Prediction using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer’s Disease used GCN to represent populations as a sparse graph where its nodes were associated with imaging-based feature vectors, while phenotypic information was integrated as edge weights.

  • Towards Gene Expression Convolutions using Gene Interaction Graphs, a rather poorly-written paper from Yoshua Bengio’s group which nonetheless shows high-level interest in this domain, applied GCN to the analysis of RNA-Seq gene expression data from The Cancer Genome Atlas (TCGA). They explored the use of GCN with dropout, and gene embeddings (to utilize the graph information), with perhaps the main finding reported being that while the GCN approach provided an advantage for particular tasks in a low data regime, it was very dependent on the quality of the graph used. That conclusion echoes my long-standing emphasis on data quality, described in my “Data Quality” subsection elsewhere in this REVIEW.

In 2011 Geoff Hinton, building on his earlier thinking spanning decades (Understanding Hinton’s Capsule Networks. Part I: Intuition), introduced his thoughts on “capsules” in his paper Transforming Auto-Encoders. The use of capsules introduced a simple and elegant new building block that could be used in deep learning to better model hierarchical relationships inside of internal knowledge representation of a neural network. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part (Dynamic Routing Between Capsules). That latter paper introduced capsule networks, which are also discussed in Understanding Capsule Network Architecture.

In Graph Capsule Convolutional Neural Networks  [code: “Our GCAPS-CNN code and data will be made available at Github.” – not available, 2018-08-14)], the authors build on both GCN and Hinton’s capsules to tackle some of the basic weaknesses in GCN, proposing a Graph Capsule Network (GCAPS-CNN) model designed to solve especially graph classification problem which current GCNN (graph CNN) models find challenging. Through extensive experiments, they showed that their proposed graph capsule network significantly outperformed existing state of the art deep learning methods and graph kernels on graph classification benchmark datasets – for example, the bioinformatics datasets summarized in their Table 1. Their graph capsule network model captured more local structure information than traditional GCNN, and could provide much richer representation of individual graph nodes (or for the whole graph), preserving more information in a local pool.

[Table of Contents]

Graph Attention Networks

Graph-structured data arise naturally in many different application domains. By representing data as graphs we can capture entities (nodes) and their relationships (edges). Useful insights can be derived from graph-structured data, demonstrated by an ever-growing body of work focused on graph mining. However, in the real-world, graphs can be both large (with many complex patterns) and noisy, which poses problems for effective graph mining. An effective way to deal with this issue is to incorporate “attention” into graph mining solutions. An attention mechanism allows a method to focus on task-relevant parts of the graph, helping it to make better decisions. Attention Models in Graphs: A Survey provides a recent focused survey of the literature on the emerging field of graph attention models.

Graph attention networks  [code;  non-author code here and here] are neural network architectures that operate on graph structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions (GCN, above) or their approximations. The GAT models described in this paper (by Yoshua Bengio and colleagues) achieved or matched state of the art results across four established transductive and inductive graph benchmarks: the Cora, CiteSeerX and Pubmed citation network datasets, as well as a protein-protein interaction dataset (in which test graphs remained unseen during training). Note Figure 2 in that paper, showing a t-SNE plot of the computed feature representations of a pretrained GAT model’s first hidden layer on the Cora dataset. GAT appears to offer better results than GCN (above) and GraphSAGE (below).


Discussed in Graph Attention Networks, graph attentional layers in GAT directly address several issues affecting prior approaches to modeling graph structured data with neural networks:

  • Computational efficiency: self-attentional layer operations can be parallelized across all edges, and the computation of output features can be parallelized across all nodes. The time complexity of a single GAT attention head computing F' features may be expressed as O(|V|FF' + |E|F'), where F is the number of input features, and |V| and |E| are the numbers of nodes and edges in the graph, respectively. This complexity is on par with the baseline methods such as graph convolutional networks (GCN).

  • As opposed to GCN, GAT stack graph layers in which nodes are able to attend over their neighborhood features, which enables (implicitly) specifying different weights to different nodes in a neighborhood, enabling a leap in model capacity without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. Furthermore, analyzing the learned attentional weights may lead to benefits in interpretability, as was the case in the machine translation domain.

  • The attention mechanism is applied in a shared manner to all edges in the graph, and therefore it does not depend on upfront access to the global graph structure or (features of) all of its nodes (a limitation of many prior techniques). This has several desirable implications: the graph is not required to be undirected, and it makes the GAT technique directly applicable to inductive learning – including tasks where the model is evaluated on graphs that are completely unseen during training.

  • Unlike previous methods which sampled a fixed-size neighborhood of each node to keep its computational footprint consistent (therefore not allowing access to the entirety of the neighborhood while performing inference), GAT works with the entirety of the neighborhood (at the expense of a variable computational footprint, which is still on-par with methods like the GCN) and does not assume any ordering within it.

  • The authors were also able to produce a version of the GAT layer that leverages sparse matrix operations, reducing the storage complexity to linear in the number of nodes and edges and enabling the execution of GAT models on larger graph datasets.

Deeply Learning Molecular Structure-Property Relationships using Graph Attention Neural Network applied GAT to the study of molecular structure-property relationships, which are key to molecular engineering for materials and drug discovery. The authors showed that GAT could greatly improve performance of the deep learning for chemistry, distinguishing atoms in different environments and extracting important structural features determining target properties (such as molecular polarity, solubility, and energy). Interestingly, it identified two distinct parts of molecules; as a result, it could accurately predict molecular properties. Moreover, the resultant latent space was well-organized such that molecules with similar properties were closely located, which is critical for successful molecular engineering.

Aside: returning to the graph convolutional networks domain, Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts [discussion] explored the performance of deep learning methods in the context of drug discovery, comparing machine learned features against the domain expert engineered features that are mainstream in the pharmaceutical industry.

In an innovative approach, Attention-based Graph Neural Network for Semi-supervised Learning  [non-author code] proposed a novel graph neural network that removed all the intermediate fully-connected layers, and replaced the propagation layers with attention mechanisms that respected the structure of the graph. The attention mechanism allowed them to learn a dynamic and adaptive local summary of the neighborhood to achieve more accurate predictions. Using this approach, the authors attained state of the art results on classification of a PubMed citation network (shown in Table 2 in that paper). Notably, as shown in Fig. 2, AGNN correctly classified three target nodes in the test set that were misclassified by a graph convolutional network (GCN), and shows how the local attention network weighted the contribution of its local neighborhood.

[Table of Contents]

Question Answering Over Knowledge Graphs

As with textual sources, knowledge graphs (which provide well structured relational information between entities, and allow inference over indirect facts) may be utilized for questioning answering. Recent neural network-based approaches to knowledge graph based QA include:

  • attention-based, sequence-to-sequence models; e.g., A Joint Model for Question Answering and Question Generation, which used a sequence-to-sequence framework that encoded the document and generated a question given an answer, and an answer given a question;

  • Simple Question Answering by Attentive Convolutional Neural Network, which focused on answering single-relation factoid questions over a knowledge base (graph: Freebase). Each question could acquire the answer from a single fact as subject-predicate-object triples via a two-step pipeline: entity linking and fact selection. In fact selection, the subject entity in a fact candidate was matched with the entity mention in the question by a character-level convolutional neural network (char-CNN), while the predicate in that fact with was matched with the question by a word-level CNN (word-CNN);

  • Neural Network-Based Question Answering Over Knowledge Graphs on Word and Character Level, which trained a neural network for answering simple questions in an end-to-end manner, leaving all decisions to the model. It learned to rank subject-predicate pairs to enable the retrieval of relevant facts, given a question. The network contains a nested word/character-level question encoder that allows it to handle out-of-vocabulary and rare word problems, while still being able to exploit word-level semantics;

  • Variational Reasoning for Question Answering with Knowledge Graph, which provided a unified deep learning architecture and an end-to-end variational learning algorithm that can handle noise in questions, and learn multi-hop reasoning simultaneously;

  • EARL: Joint Entity and Relation Linking for Question Answering Over Knowledge Graphs, which performed entity linking and relation linking as a joint task;

  • Andrew McCallum et al. (Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning  [code] proposed the MINERVA algorithm, that addressed the difficult and practical task of answering questions where only one entity of the relation was known (versus finding paths connecting a given pair of entities). Since random walks are impractical in a setting with combinatorially many destinations from a start node, they presented a neural reinforcement learning approach that learned how to navigate the graph conditioned on the input query to find predictive paths. Empirically, this approach significantly outperformed prior methods on several datasets. The model was robust, and could learn long chains-of-reasoning (~93% accuracy on 8-10 path length: note their Fig. 2); moreover it did not require pretraining or initial supervision.

  • An implementation of a tumor-based knowledge graph is presented in “Semantically Linking In silico Cancer Models](, which introduced a graph-based model for semantically linking computational cancer models via domain graphs to better understand and explore combinations of models spanning multiple biological scales.

    • While I find their implementation to be awkwardly conceptualized and described, it nonetheless provides examples of Cypher-based question answer queries that are relevant to this REVIEW (note that I implemented my KG in Neo4j, which uses the Cypher graph query language).

An interesting graph-based approach to question answering is the use of subgraphs. Work from the Allen Institute for AI, Answering Complex Questions Using Open Information Extraction, used a recently proposed support graph optimization framework for QA to develop a new inference model for Open IE. Their model significantly outperformed a state of the art structured solver on complex questions of varying difficulty, while also removing the reliance on manually curated knowledge. Their system, TupleINF – inspired by their earlier “TableILP” work Question Answering via Integer Programming over Semi-Structured Knowledge  [code] – used Integer Linear Programming (ILP) to solve QA over complex structures that require multi-fact reasoning. [Note that integer linear programming differs from the also-abbreviated inductive logic programming, which is discussed elsewhere in this REVIEW regarding Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME.] An integer linear program has the same form as a linear program; the only difference is that the feasible solutions are restricted to be integer vectors, i.e. x ∈ ℤⁿ instead of x ∈ ℝⁿ. Similar to TableILP, they viewed the QA task as searching for a graph (support graph search) that best connects the terms in the question (qterms) with an answer choice via the knowledge (see Fig. 1 in their paper for a simple illustrative example). The qterms, answer choices, and tuples fields form the set of possible vertices, V, of the support graph. Edges connecting qterms to tuple fields and tuple fields to answer choices form the set of possible edges, E. The support graph, G(V,E), is a subgraph of G(V,E) where V and E denote “active” nodes and edges, respectively. Similar to TableILP, they scored the support graph based on the weight of the active nodes and edges.

The approach of using subgraphs and Integer Linear Programming (ILP) in QA was also described on slides 15-25 in Reasoning-driven Question Answering by University of Pennsylvania Ph.D. candidate Daniel Khashabi, at a Stanford NLP seminar (June 2018). The work summarized in that presentation is described in the UPenn/Allen Institute for Artificial Intelligence paper Question Answering as Global Reasoning over Semantic Abstractions  [code], which introduced their SemanticILP system. They claimed SemanticILP to be the first system that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pretrained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, they translated QA into a search for an optimal subgraph that satisfied certain global and local properties. SemanticILP outperformed baseline models BiDAF, an information retrieval solver, and TupleINF on QA tasks on 4th and 8th grade science exams and a biology exam (Tables 4 and 5 in their paper). Thus on two domains with very different characteristics, the graph-based SemanticILP outperformed traditional and neural network models.

Most research in reading comprehension has focused on answering questions based on individual documents or even single paragraphs. Question Answering by Reasoning Across Documents with Graph Convolutional Networks introduced a method which integrates and reasons relying on information spread within documents and across multiple documents. They framed this task as an inference problem on a graph, with mentions of entities as nodes and edges encoding relations between different mentions. Graph convolutional networks (GCN) were applied to these graphs and trained to perform multi-step reasoning. Their Entity-GCN method achieved state of the art results on the WikiHop dataset.

“The methods reported in Welbl et al. (2017) approach the task by merely concatenating all documents into a single long text and training a standard reading comprehension model, namely, BiDAF and FastQA. Instead, we frame question answering as an inference problem on a graph representing the document collection. Nodes in this graph correspond to named entities in the document whereas edges encode simple relations between them (e.g., cross-document and within-document coreferences). We assume that the chains of reasoning can be captured by propagating information, first extracted by a deep contextualized word representation such as ELMo, along edges in this graph using a graph convolutional network (GCN). Our Entity-GCN model substantially outperforms previous systems. Our ablation studies confirm that we benefit both from using contextualized word embeddings and from modeling relations between entities.

[Table of Contents]

Graph Representation Learning

Representation learning, variously mentioned above, is a class of machine learning techniques that autonomously learn to derive latent features from raw data. Representation learning of knowledge graphs (graph representation learning) aims to project both entities and relations into a continuous low-dimensional space (Representation Learning of Knowledge Graphs with Entity Descriptions).  Exploring the Semantic Content of Unsupervised Graph Embeddings: an Empirical Study reviews graph embedding techniques, demonstrating that topological features are approximated by the embedding space, allowing key insight into how graph embeddings create good representations. [“Big-O” complexity is also discussed (e.g., DeepWalk is θ(|V|)].

Representation Learning on Graphs: Methods and Applications  (presented by coauthor Jure Leskovec as a keynote speech at the 4th IEEE BigData Conference (Boston, MA: December 2017; slides) provides a superb overview of machine learning on graphs and graph representation learning. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. GraphSAGE, discussed in the following paragraph, is discussed on pp. 12-13.

Inductive Representation Learning on Large Graphs  (also by Jure Leskovec, Stanford University)  [projectcode] introduced “GraphSAGE,” a general, inductive framework which leveraged node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, they learned a function that generated embeddings by sampling and aggregating features from a node’s local neighborhood. Relevant to the biomedical domain, the authors classified the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and showed that their algorithm generalized to completely unseen graphs using a multi-graph dataset of protein-protein interactions.

  • Follow-on work by MIT (Representation Learning on Graphs with Jumping Knowledge Networks) explored a “jumping knowledge networks” architecture that for each node flexibly leveraged different neighborhood ranges to enable better structure-aware representation. In a number of experiments on social, bioinformatics and citation networks they demonstrated state of the art performance. Furthermore, combining the jumping knowledge framework with models like Graph Convolutional Networks (GCN), GraphSAGE, and Graph Attention Networks (GAT) consistently improved the performance of those models.

Graph representation learning methods can be classified into two categories: generative models that learn the underlying connectivity distribution in the graph, and discriminative models that predict the probability of edge existence between a pair of vertices. “GraphGAN” unifies these two classes of methods in an adversarial approach, in which the generative model and discriminative model play a game-theoretical minimax game for node classification and link prediction and recommendation (GraphGAN: Graph Representation Learning with Generative Adversarial Nets; see also KBGAN: Adversarial Learning For Knowledge Graph Embeddings). Specifically, for a given vertex (node), the generative model tries to fit its underlying true connectivity distribution over all other vertices and produces “fake” samples to fool the discriminative model, while the discriminative model tries to detect whether the sampled vertex is from ground truth or generated by the generative model. With the competition between these two models, both of them can alternately and iteratively boost their performance. It has been demonstrated (GraphGAN: Graph Representation Learning with Generative Adversarial Nets) that graph representation learning with generative adversarial nets achieves substantial gains over state of the art baselines in a variety of applications including link prediction, node classification, and recommendation.

Disease Related Knowledge Summarization Based on Deep Graph Search  [code] (note Figs. 2, 3) presented a graph-based attention model (GRAM), which addressed two challenges of deep learning applied to healthcare-related knowledge graphs: data insufficiency (insufficient sample size), and interpretation (representations learned by deep learning methods should align with medical knowledge). GRAM supplemented electronic health records with the hierarchical information inherent in medical ontologies (represented as a directed acyclic graph). The ontological KG provided a multi-resolution view of medical concepts upon which GRAM analyzed health records via recurrent neural networks to learn accurate and interpretable representations of medical concepts. GRAM chooses a weighted average of ancestors of a medical concept, and trains the entire process with a predictive model in an end-to-end fashion.

A new approach for learning graph embeddings that relies on structural measures of node similarities for generation of training data was presented in Learning Graph Embeddings from WordNet-based Similarity Measures  [codedatasets]. Their path2vec model learned node embeddings that were able to approximate a given measure, such as the shortest path distance (node similarities, based on network structure) or any other. In contrast to existing approaches, path2vec does not rely on random walks or autoencoder architectures (link prediction) and provides a way to supply additional information for the target measure, not only the graph structure itself. Thus, in applications one could replace path-based measures with these node embeddings, gaining significant speedup in yielding distances and nearest neighbors. path2vec can be trained on arbitrary graph measures and is not restricted to the shortest path or to only tree-structured graphs. Evaluations of the proposed model on semantic similarity and word sense disambiguation tasks (using WordNet as the source of gold similarities) showed that path2vec yielded state of the art results. Notably, the model was computationally efficient, orders of magnitude faster than the direct computation of graph distances: in the WordNet semantic similarity task the learned embeddings trained 3 or 4 orders of magnitude faster than other embedding approaches, depending on the similarity measures used (see Table 1 in their paper).

  • Ruslan Salakhutdinov and colleagues proposed a probabilistic graphical model that leveraged a topic model to design a WSD system that scaled linearly with the number of words in the context (Knowledge-based Word Sense Disambiguation using Topic Models). Their logistic normal topic model – a variant of Latent Dirichlet Allocation in which the topic proportions for a document were replaced by WordNet synsets (sets of synonyms) – incorporated semantic information about synsets as its priors. They used a non-uniform prior for the synset distribution over words to model the frequency of words within a synset, and they modeled the relationships between synsets by using a logistic-normal prior for drawing the synset proportions of the document. This made their model similar to the correlated topic model, with the difference that their priors were fixed, not learned. In particular, the values of those priors were determined using the knowledge from WordNet. They illustrated the benefit of using the whole document as the context for disambiguation. The proposed model, WSD-TM, outperformed state of the art knowledge-based WSD systems.

    • WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. A synset is a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

    • Excerpted from the paper:

      Consider an excerpt from the SensEval 2 dataset shown in Figure 3:

    “Medical scientists are starting to uncover a handful of genes which if damaged unleash the chaotic growth of cells that characterizes cancer. Scientists say the discovery of these genes in recent months is painting a new and startling picture of how cancer develops. An emerging understanding of the genes is expected to produce an array of new strategies for future cancer treatment and prevention. That is for the future. Already scientists are developing tests based on the newly identified genes that for the first time can predict whether an otherwise healthy individual is likely to get cancer. “It’s a super-exciting set of discoveries,” says Bert Vogelstein, a Johns Hopkins University researcher.”

    Highlighted [bolded] words clearly indicate that the domain of the document is biology. While most of these words are monosemous, let’s consider disambiguating the word “cell”, which is highly polysemous, having 7 possible [WordNet] senses as shown in Figure 5.

    WordNet Search: “cell”:

        S: (n) cell (any small compartment) “the cells of a honeycomb”
        S: (n) cell ((biology) the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in higher plants and animals)
        S: (n) cell, electric cell (a device that delivers an electric current as the result of a chemical reaction)
        S: (n) cell, cadre (a small unit serving as part of or as the nucleus of a larger political movement)
        S: (n) cellular telephone, cellular phone, cellphone, cell, mobile phone (a hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver)
        S: (n) cell, cubicle (small room in which a monk or nun lives)
        S: (n) cell, jail cell, prison cell (a room where a prisoner is kept)

    As shown in Table 3, the correct sense of “cell” (“cell#2”) has the highest similarity with senses of three monosemous words “scientist”, “researcher” and “protein”. The word “cell” occurs 21 times in the document, and several times the other words in the sentence are not adequate to disambiguate it. Since our method uses the whole document as the context, words such as “scientists”, “researchers” and “protein” help in disambiguating “cell”, which is not possible otherwise.

  • Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs, discussed elsewhere in this REVIEW, also employed WordNet synsets for word sense disambiguation (e.g., their Table 4).

Deep Graphs, which applied recurrent neural networks (RNN) to graphs and networks, relies on the notion that many graph algorithms can be expressed as iterative vertex updates, where the features of each vertex (or edge) are computed as a function of the neighboring vertex and edge features. Therefore, training and deployment are both efficient, since the cost is O(\|E\| + \|V\|), where E and V are the sets of edges and vertices respectively. The vertex features can be used to perform multiple tasks simultaneously, e.g. in a multitask learning setting, and the proposed algorithm is able to learn the functions that generate these features and the functions that perform these tasks, jointly. Furthermore, it is able to learn representations for graph vertices and edges that can, subsequently, be used by other algorithms to perform new tasks, such as in transfer learning. Finally, apart from the graph structure, this algorithm is also able to use attributes of the vertices and edges of the graph, rendering it even more powerful in cases where such attributes are available. The authors claimed that Deep Graphs was able to outperform all competing methods for vertex classification tasks by a significant amount (they achieved an AUC of 0.98 while the best competing method achieved 0.67). Furthermore, Deep Graphs could learn to compute PageRank and HITS scores by only applying a vertex update equation 6 times, as opposed to hundreds or thousands of iterations which the original algorithms require. Fig. 1 in that paper shows the node, edge and embedded features that may be be computed/inferred.

The use of constraint graphs for finding optimal Bayesian networks (graphs) is an intuitive method for incorporating rich prior knowledge into the structure learning tasks (Finding the Optimal Bayesian Network Given a Constraint Graph  [code]. The Jupyter notebook provided by these authors provides toy examples, including a three-layer network with:

  • health symptoms (low energy; bloating; loss of appetite; vomiting; abdominal cramps) on the bottom layer;
  • diseases (ovarian cancer; lactose intolerance; pregnancy) on the middle layer; and,
  • genetic tests on the top layer

… for three different genetic mutations (BRCA1, BRCA2, and LCT). The edges in this graph were constrained such that symptoms were explained by diseases, and diseases could be partially explained by genetic mutations. There were no edges from diseases to genetic conditions, and no edges from genetic conditions to symptoms. The authors defined a “constraint graph” which comprised three nodes; “symptoms”, “diseases”, and “genetic mutations”. There was a directed edge from genetic mutations to diseases, and a directed edge from diseases to symptoms. This specified that genetic mutations could be parents to diseases, and diseases to symptoms. Priors were simply programmed Python expressions from the constraint graph, and a resultant learned Bayesian network was returned. Relevant to this REVIEW, this approach could enable very interesting explorations of the KG, for example the discovery of latent/learned embeddings.

While RNN are a cornerstone in learning latent representations from long text sequences, a purely convolutional and deconvolutional autoencoding framework may be employed, as described in Deconvolutional Paragraph Representation Learning  [code here and here]. That paper addressed the issue that the quality of sentences during RNN-based decoding (reconstruction) decreased with the length of the text. Compared to RNN, their framework was better at reconstructing and correcting long paragraphs (note Table 1 in their paper, showing paragraphs reconstructed by LSTM and CNN, as well as the vastly superior BLEU/ROUGE scores in Table 2). There is also additional NLP-related LSTM vs. CNN discussion in this Hacker News thread.

[Table of Contents]

Graph Generation Through Generative Adversarial Networks

Related to graph representation learning (above) is the building of graphs through Generative Adversarial Networks (GAN) or sequence models (e.g., RNN) [discussion]. Generative models for real-world graphs have important applications in many domains, including modeling physical and social interactions, discovering new chemical and molecular structures, and constructing knowledge graphs.

DeepMind’s Learning Deep Generative Models of Graphs introduced the “DeepGMG” approach for learning generative models over graphs, which could capture both their structure and attributes. Their approach provided a general approach for learning generative models over arbitrary graphs (capable of generating arbitrary graphs by sequentially adding nodes and edges), and opened new directions for moving away from restrictions of vector- and sequence-like knowledge representations, toward more expressive and flexible relational data structures. There are two major distinctions between the proposed graph model and typical sequence models like LSTMs: model architecture (this model is graph-structured with structured memory, while the LSTM stores information in flat vectors), and graph generating grammar (this model uses the generic graph generating decision sequence for generation, while the LSTM uses a linearization of the graphs, which typically encodes knowledge specific to a domain). For example, for molecules SMILES strings provide a way to linearize a molecules chemical structure (“graph”) into a string, addressing practical issues like the ordering of nodes. [The Simplified Molecular-Input Line-Entry System (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.]

NetGAN: Generating Graphs via Random Walks  [code] posed the problem of graph generation as learning the distribution of biased random walks over the input graph. The proposed model was based on a stochastic neural network that generated discrete output samples and was trained using the Wasserstein GAN objective. NetGAN was able to produce graphs that exhibited well-known network patterns without explicitly specifying them in the model definition. At the same time, NetGAN exhibited strong generalization properties, highlighted by its competitive link prediction performance despite not being trained specifically for this task.

Modeling and generating graphs is fundamental for studying networks in biology and other sciences; however, modeling complex distributions over graphs and then efficiently sampling from these distributions is challenging due to the non-unique, high-dimensional nature of graphs and the complex, non-local dependencies that exist between edges in a given graph. GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models  [slidescode] described a deep autoregressive model that addressed those challenges and approximated any distribution of graphs with minimal assumptions about their structure. GraphRNN learned to generate graphs by training on a representative set of graphs and decomposes the graph generation process into an iterative sequence of node and edge formations, conditioned on the preceding graph structure.

NetGAN focuses on learning an implicit generative model from a single large graph (e.g. a large social network), whereas GraphRNN learns from many small graphs, such as molecules. Since NetGAN has a single training graph, it formulates the problem as learning a distribution of random walks over the graph, which also solves scaling challenges. [Source: NetGAN author Aleksandar Bojchevski on reddit.]

[Table of Contents]

Subgraphs: Higher-Order Organization of Complex Networks

Analogous to clustering (dimensionality reduction) approaches applied to textual sources, content within knowledge graphs may also be classified and clustered. Many networks exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. In Higher-Order Organization of Complex Networks  [pdfprojectslidesmedia] Jure Leskovec and colleagues at Stanford University developed a generalized framework for clustering networks on the basis of higher-order connectivity patterns, at the level of small network subgraphs. [See also other work by Leskovec, described in the “Graph Representation Learning” subsection of this review.] That project focused on finding higher-order organization of complex networks at the level of small network subgraphs (motifs), by clustering the nodes of a graph based on motifs instead of edges via a motif-based spectral clustering method. The approach is well explained and illustrated on their project page, for example using a “bifan” motif to find clusters in networks such as C. elegans (explained in more detail in Fig. 2 in the paper), or coherent feedforward loop motifs in the S. cerevisiae transcriptional regulation network.

Likewise, portions of graphs may also be classified, as described in MotifNet: A Motif-Based Graph Convolutional Network for Directed Graphs, a graph CNN approach. Broadly speaking, there are two classes of graph CNN formulations: spatial approaches (which generalize subgraph regions by constructing a local system of weights on the graph), and spectral approaches (which use the analogy between the eigenfunctions of the graph Laplacian and the classical Fourier transform, and define a convolution-like operation in the spectral domain). One of the key drawbacks of spectral CNN is the explicit assumption of an undirected graph, leading to a symmetric Laplacian matrix with orthogonal eigendecomposition. This paper introduced MotifNet, a graph CNN for directed graphs that used convolution-like anisotropic graph filters bases on local subgraph structures (motifs). An attention mechanism allowed MotifNet to generalize some standard graph CNN models without significantly increasing the model complexity. The approach was tested on the directed CORA citation network (Automating the Construction of Internet Portals with Machine LearningDeep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking), achieving superior classification accuracy compared to ChebNet (Semi-Supervised Classification with Graph Convolutional Networksdiscussion).

  • Note that the 2018 “MotifNet” paper does not cite earlier work (2017; different authors), Motif-based Convolutional Neural Network on Graphs  [code] – perhaps due to this statement in the 2017 paper: “We exclude spectral methods as they have been shown inferior to GCN.“  Motif-based Convolutional Neural Network on Graphs (2017) introduced a generalization of CNNs to graphs with irregular linkage structures, especially heterogeneous graphs with typed nodes and schemas. They proposed a novel spatial convolution operation to model the key properties of local connectivity and translation invariance, using high-order connection patterns or motifs (Motif-CNN). Like the subsequent MotifNet paper, Motif-CNN employed an attention model to combine the features extracted from multiple patterns, thus effectively capturing high-order structural and feature information. Experiments on semi-supervised node classification on real-world social networks and multiple representative heterogeneous graph datasets indicated significant gains (6-21%) over existing graph CNNs and other state of the art techniques.

[Table of Contents]

Text Grounding: Mapping Text to Knowledge Graphs

In NLP, the ability to code text into an entity of a knowledge graph finds applications in tasks such as question answering and information retrieval, or any task that involves some form of mapping a definition to a term (Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs and references therein). Further, matching text to graphs can be invaluable in providing solutions to domain-specific challenges, for example medical concept normalisation (disambiguation) and identification of adverse drug reactions.

Augmenting KG with external resources such as textual knowledge stores and ontologies is another obvious approach for providing explanatory material to KG. This is particularly important when machine learning approaches are employed in the biomedical and clinical sciences domains, where high precision is imperative for supporting, not distracting practitioners (What Do We Need to Build Explainable AI Systems for the Medical Domain?”), and it is crucial to underpin machine output with reasons that are human-verifiable. [Hype surrounding IBM Watson and its failures in medical use offers an exemplary cautionary tale: relevant commentary herehereherehereherehere (local copy) and here.] Paraphrased from What Do We Need to Build Explainable AI Systems for the Medical Domain?”:

  • “The only way forward seems to be the integration of both knowledge-based and neural approaches to combine the interpretability of the former with the high efficiency of the latter. To this end, there have been attempts to retrofit neural embeddings with information from knowledge bases as well as to project embedding dimensions onto interpretable low-dimensional sub-spaces. More promising, in our opinion, is the use of hybrid distributional models that combine sparse graph-based representations with dense vector representations and link them to lexical resources and knowledge bases. Here a hybrid human-in-the-loop approach can be beneficial, where not only the machine learning models for knowledge extraction are supported and improved over time, the final entity graph becomes larger, cleaner, more precise and thus more usable for domain experts. Contrary to classical automatic machine learning, human-in-the-loop approaches do not operate on predefined training or test sets, but assume that human expert input regarding system improvement is supplied iteratively.”

Regarding the “hybrid human-in-the-loop approach” mentioned in What Do We Need to Build Explainable AI Systems for the Medical Domain?”, the Never-Ending Language Learner [NELLproject] occasionally interacts with human trainers, mostly for negative feedback identifying NELL’s incorrect beliefs (see Section 7 and Fig. 7 in that paper), though this human-machine interaction has been decreasing as NELL gets more accurate.

Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs  [code to be released at [SIPHS](] addresses the problem of mapping natural language text to knowledge base entities. This paper detailed a model for efficiently mapping unrestricted text at the level of phrases and sentences to the entities of a knowledge base (KB) - a task also referred to as text grounding or normalisation. The model aimed at characterising short focused texts, such as definitions or tweets; for example, given a biomedical KB a tweet of the form “Can’t sleep, too tired to think straight” would be mapped to the entity “Insomnia.” The process of text-to-entity mapping was treated as a transformation from a textual vector space, to a KB vector space created from a graph and populated by vectors representing entities. The mapping process was approached as a composition of a phrase or a sentence into a point in a multi-dimensional entity space obtained from a knowledge graph. The compositional model was an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a Multi-Sense LSTM or MS-LSTM), addressing polysemy issues. Further, the knowledge base space was prepared by collecting random walks from a graph enhanced with textual features, which acted as a set of semantic bridges between text and knowledge base entities. The ideas of this work were demonstrated on large-scale text-to-entity mapping and entity classification tasks, with state of the art results.

[Table of Contents]


Although multitask learning and transfer learning have similarities, they are not the same. Transfer learning only aims at achieving high performance in the target task by transferring knowledge from the source task, while multitask learning tries to learn the target and the source tasks simultaneously. For an excellent overview of transfer learning see Sebastian Ruder’s Transfer LearningMore Effective Transfer Learning for NLP also provides a concise introduction to language models and transfer learning. For a review of deep transfer learning, see A Survey on Deep Transfer Learning.

Transfer Learning (a subfield of which is domain adaptation) is the reuse of a pretrained model on a new problem. In transfer learning, the knowledge of an already trained machine learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a network has learned at Task A to a new Task B. The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data. Instead of starting the learning process from scratch, you start from patterns that have been learned from solving a related task.

Whereas in transfer learning we mainly care about doing well on our target task or domain, in multitask learning the objective is to do well on all available tasks (An Overview of Multi-Task Learning in Deep Neural Networks). Alternatively, we can also use the knowledge acquired by learning from related tasks to do well on a target. Crucially, in contrast to transfer learning, some labeled data is usually assumed for each task. In addition, models are trained jointly on source and target task data, which is not the case for all transfer learning scenarios. However, even if target data is not available during training, insights about tasks that are beneficial for multitask learning can still inform transfer learning decisions.

A 2016 paper by Socher and colleagues (SalesForce), A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks  [OpenReviewcodenon-author code; online demo (not working, 2018-08-16)], described a multitask NLP model. In that earlier work they introduced a joint many-task model together with a strategy for successively growing its depth (stacked Bi-LSTM layers: POS tagging in the first Bi-LSTM layer; word chunking in layer 2; dependency parsing in layer 3; semantic relatedness and textual entailment in layers 4 and 5, respectively) to solve increasingly complex tasks. Higher layers included shortcut connections [i.e., “residual layers” (Deep Residual Learning for Image Recognition)] to lower-level task predictions to reflect linguistic hierarchies. They used a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks. Their single end-to-end model obtained state of the art or competitive results [late 2016] on five different tasks from tagging, chunking, parsing, relatedness, and entailment tasks. The inference procedure of their model began at the lowest level and worked upward to higher layers and more complex tasks; their model handled the five different tasks in the order of POS tagging, chunking, dependency parsing, semantic relatedness, and textual entailment. Their hypothesis was that it was important to start from the lower-level and gradually move to higher-level tasks. For example, introducing multi-task syntactic supervision (e.g., POS: part-of-speech tags) at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing.

Table 1 in A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks shows the very interesting results on the test sets for the five tasks (tagging; chunking; parsing; relatedness; entailment) described in that paper. The column “Single” showed the results of handling each task separately using single-layer Bi-LSTMs, while the column “JMT all” showed the results of their Joint Many-Task (JMT) model. The single task settings used only the annotations of their own tasks; for example, when handling dependency parsing as a single task the POS and chunking tags were not used. All results of the five tasks were improved in the JMT model, which showed that model could handle the five different tasks in a single model. The JMT model also allowed the authors to access arbitrary information learned from the different tasks. To use the model just as a POS tagger, one could use only the first Bi-LSTM layer. Table 1 also showed the results of five subsets of the different tasks. For example, in the case of “JMT ABC”, only the first three layers of the Bi-LSTMs were used to handle the three tasks. In the case of “JMT DE”, only the top two layers were used as a two-layer Bi-LSTM, by omitting all information from the first three layers. The results of the closely-related tasks (“AB”, “ABC”, and “DE”) showed that the JMT model improved both of the high-level and low-level tasks. The results of “JMT CD” and “JMT CE” showed that the parsing task could be improved by the semantic tasks. Other results suggested that the Bi-LSTMs efficiently captured global information necessary for dependency parsing. Moreover, their single task [dependency parsing] result already achieved high accuracy, without the POS and chunking information. Those insights also provided insight into the effectiveness of the layers in ELMo’s bidirectional language model (biLM: Deep contextualized word representations). Section 5.3 in that that paper explored the different types of contextual information captured in biLMs and used two intrinsic evaluations to show that syntactic information was better represented at lower layers, while semantic information was captured at higher layers.

Computer vision has benefited from initializing multiple deep layers with weights pretrained on large supervised training sets like ImageNet, whereas NLP typically sees initialization of only the lowest layer of deep models with pretrained word vectors. A 2017 paper by Socher and colleagues (SalesForce), Learned in Translation: Contextualized Word Vectors  [codeauthor’s blog] introduced an approach for transferring knowledge from an encoder pretrained on machine translation to a variety of downstream NLP tasks. They used a deep LSTM encoder from an attentional seq2seq model trained for machine translation to contextualize word vectors. They showed that adding these context vectors (CoVe) improved performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis, question classification, entailment, and question answering. The ability to share a common representation of words in the context of sentences that include them could further improve transfer learning in NLP.

The Natural Language Decathlon: Multitask Learning as Question Answering  [codeproject], also by Richard Socher and colleagues at SalesForce, described the “decaNLP” challenge which spans 10 tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. The decaNLP tasks were cast as question answering over a context: in the multitask setting, their Multitask Question Answering Network (MQAN) model [code here and here] jointly learned all the tasks without any task-specific modules or parameters.

With regard to multitask learning, MQAN makes use of a novel dual coattention and multi-pointer-generator decoder to multitask across all tasks in decaNLP. The Natural Language Decathlon: Multitask Learning as Question Answering demonstrated that jointly training a MQAN on all tasks with the right anti-curriculum strategy could achieve performance comparable to that of ten separate MQAN, each trained separately. As stated in the paper (paraphrased here): “… multitask learning has been successful when models are able to capitalize on relatedness amongst tasks while mitigating interference from dissimilarities. When tasks are sufficiently related, they can provide an inductive bias that forces models to learn more generally useful representations. By unifying tasks under a single perspective, it is possible to explore these relationships.

The “decaNLP/MQAN” paper offered some insight into how MQAN attained robust multitask learning. Some key observations from the paper in this regard include the following:

  • MQAN was able to perform nearly as well or better in the multitask setting as in the single task setting for each task despite being capped at the same number of trainable parameters in both (see Table 2 in that paper). A collection of MQAN trained for each task individually would use far more trainable parameters than a single MQAN trained jointly on decaNLP. This suggests that MQAN successfully uses trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limits catastrophic forgetting.

  • MQAN trained on decaNLP learns to generalize beyond the specific domains for any one task, while also learning representations that make learning completely new tasks easier.

  • In an analysis of how MQAN chooses to output answer words:  “The models reliance on the question pointer for SST [the Stanford Sentiment Treebank, i.e. the “sentiment analysis” task] (see Figure 3) allows it to copy different, but related class labels with little confusion. This suggests these multitask models are more robust to slight variations in questions and tasks and can generalize to new and unseen classes. These results demonstrate that models trained on decaNLP have potential to simultaneously generalize to out-of-domain contexts and questions for multiple tasks and even adapt to unseen classes for text classification. This kind of zero-shot domain adaptation in both input and output spaces suggests that the breadth of tasks in decaNLP encourages generalization beyond what can be achieved by training for a single task.”

  • MQAN trained on decaNLP is the first, single model to achieve reasonable performance on such a wide variety of complex NLP tasks without task-specific modules or parameters, with little evidence of catastrophic interference, and without parse trees, chunks, POS tags, or other intermediate representations.

  • Appendix D discusses the round-robin curriculum learning: “Our results demonstrate that training the MQAN jointly on all tasks with the right anti-curriculum strategy can achieve performance comparable to that of ten separate MQANs, each trained separately.” [See Table 2 and Section 4.2 in that paper.] Finding that anti-curriculum learning benefited models in the decaNLP also validated intuitions outlined in [Caruana, 1997]: tasks that are easily learned may not lead to development of internal representations that are useful to other tasks. Our results actually suggest a stronger claim: including easy tasks early on in training makes it more difficult to learn internal representations that are useful to other tasks. [ … snip … ] By ordering the tasks differently, it is possible to improve performance on some of the tasks but that improvement is not without a concomitant drop in performance for others.”

Like its application to multitask learning, MQAN pretrained on decaNLP also showed improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. Though not explicitly designed for any one task, MQAN proved to be a strong model in the single-task setting as well, achieving state of the art results on the semantic parsing component of decaNLP (i.e., transfer learning). Section 5 of that paper summarized the state of the art (mid-2018) for transfer learning in NLP (paraphrased here):

  • “Most success in making use of the relatedness between natural language tasks stem from transfer learning. Word2Vec, skip-thought vectors and GloVe yield pretrained embeddings that capture useful information about natural language. The embeddings, intermediate representations, and weights of language models can be transferred to similar architectures and classification tasks. Intermediate representations from supervised machine translation models improve performance on question answering, sentiment analysis, and natural language inference. Question answering datasets support each other as well as entailment tasks, and high-resource machine translation can support low-resource machine translation. This work shows that the combination of MQAN and decaNLP makes it possible to transfer an entire end-to-end model that can be adapted for any NLP task cast as question answering.

Multi-task Learning for Universal Sentence Embeddings: A Thorough Evaluation using Transfer and Auxiliary Tasks provided a recent analysis of multitask learning in which they investigated the effectiveness of multitask learning (MTL) for training generalizable sentence representations, by evaluating on both source tasks and 10 different transfer tasks. Experiments on two categories of MTL frameworks demonstrated that multitask learning outperformed single task learning both on the source tasks and most of the transfer tasks. In addition, they analyzed what linguistic information was captured by the sentence representations. Like the other studies (this subsection, above) they showed that joint learning of multiple tasks resulted in better generalizable sentence representations, by conducting extensive experiments and analysis comparing the multitask and single-task learned sentence encoders. Multitask learning helped to embed better semantic information in the sentence representations compared to single task learning. In addition, a comparison of multitask sentence encoders with contextualized word representations showed that combining both of them could further boost the performance of transfer learning.

Virtually all systems trained on one dataset have trouble when applied to datasets that differ even slightly: even switching from Wall Street Journal (WSJ) text to New York Times text [i.e., transfer learning] can hurt parsing performance slightly. It is anticipated that deeply-trained language models (described further above) will improve transfer and multi-task learning. ELMo and OpenAI Transformer represent language models in the multitask learning domain, while ULMFiT represents a language model in the transfer learning domain (where the LM is first trained on a general-domain corpus to capture general features of the language in different layers, then fine-tuned on the target task). Regarding ELMo, a very interesting paper from the Allen Institute for AI, Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples  [code here and heredemo] used ELMo to train a span-based parser with partial annotations. First, they showed that recent advances in word representations greatly diminished the need for domain adaptation [transfer learning] when the target domain was syntactically similar to the source domain. As evidence, they trained a parser on the WSJ alone that achieved over 90% F1 on the Brown corpus, a standard benchmark used to assess WSJ-trained parsers outside of the newswire domain. For more syntactically distant domains, they provided a simple way to adapt a parser using only dozens of partial annotations. For instance, they increased the percentage of error-free geometry-domain parses in a held-out set from 45% to 73% using approximately five dozen training examples. In the process, they demonstrated a new state of the art single model result on the WSJ test set of 94.3%, an absolute increase of 1.7% over the previous state of the art of 92.6%. On a similar experiment using biomedical and chemistry text, they partially annotated 134 sentences and randomly split them into BiochemTrain (72 sentences) and BiochemDev (62 sentences). In BiochemTrain, they made an average of 4.2 constituent declarations per sentence (and no non-constituent declarations). Again, they started with a RSP parser trained on WSJ-Train, and fine-tuned it on minibatches containing annotations from 50 randomly selected WSJ-Train sentences, plus all of BiochemTrain. As with the geometry domain, we got significant improvements using only dozens of partially annotated training sentences.

Word-Level Loss Extensions for Neural Temporal Relation Classification  [code] employed pretrained word embeddings for the extraction of narrative containment relations (CR) from clinical text. The aim of CR extraction is, given events A and B, if event A is temporally contained in event B (i.e. if event A happens within the time span of event B: temporal relation extraction). The patient time-line is crucial for making a good patient prognosis and clinical decision support. In this work the authors proposed a neural RC model that learned its word representations jointly on the main task (supervised, on labeled data) and on the auxiliary task (unsupervised, on unlabeled data) in a multi-task setting – ensuring that the embeddings contained valuable information for the main task while leveraging the unlabeled data for more general feature learning. The proposed models used only unlabeled data and a general (news, out of domain) part of speech tagger as external resources. This work showed that training the word-level representations jointly on its main task plus an auxiliary objective resulted in better representations for classification, compared to using pre-trained variants. It also constituted a new state of the art for temporal relation extraction on the THYME dataset (a temporally annotated corpus of clinical notes in the colon cancer domain), even without dedicated clinical preprocessing.

[Table of Contents]


We need to be able to trust and understand how results are generated in NLP and ML based models. Recent, relevant discussions include:

Knowledge graph models attain state of the art accuracy in knowledge base completion, but their predictions are notoriously hard to interpret. The OpenKE project  [code] is an open-source framework for knowledge embedding based on the TensorFlow machine learning platform. A very interesting, heavily modified fork of of the OpenKE GitHub repository is XKE (Explaining Knowledge Embedding models), which contains implementations of XKE-PRED and XKE-TRUE described in Interpreting Embedding Models of Knowledge Bases: A Pedagogical Approach. XKE adapts “pedagogical approaches” to interpret embedding models, by extracting weighted Horn rules from them.

  • In mathematical logic and logic programming, a Horn clause is a logical formula of a particular rule-like form which gives it useful properties for use in logic programming, formal specification, and model theory.

  • A pedagogical approach is one where, intuitively speaking, a non-interpretable but accurate model is run, and an interpretable model is learned from the output of the non-interpretable one.

State of the art knowledge in base completion typically relies on embedding models that map entities and relations into low-dimensional vector space. The existence of a relation triple is determined by some pre-defined function over these representations. More importantly, embedding models turn complex space of semantic concepts into a smooth space where gradients can be calculated and followed. One difficulty with embeddings is their poor interpretability viz-a-viz its users. In Interpreting Embedding Models of Knowledge Bases: A Pedagogical Approach  [ICML 2018; code] the authors proposed two models, that respectively explained KGE models with predicted features (XKE-PRED), and with observed features (XKE-TRUE). XKE-PRED treated the embedding model as a black box and assumed no other source of information for building the interpretable model. By changing the original classifier’s inputs and observing its outputs, the pedagogical approach constructed a training set for an interpretable classifier from which explanations were extracted. XKE-TRUE is a variation of XKE-PRED that assumed an external source of knowledge (regarded as ground truth of their relational domain) besides the embedding model, from which XKE-TRUE extracted interpretable features. Skipping over additional detail (provided in the paper), the results in Table 3 in that paper are most the interesting with respect to this REVIEW: the application of those models to a Freebase dataset. Five examples were shown (ID #1-5), along with input triples (head-relation-tail), their explanations (reasons: weighted rules and the bias term) with the interpretable classifier’s score (XKE-PRED; XKE-TRUE), and with the labels predicted by the embedding model (trained as a binary classifier: 1 = True; 0 = False). While the results (discussed in Section 4.4) vary – a consequence of an early-stage research effort – the models performed remarkably well in explaining (through the explanations: weighted paths) the selected, embedded triples.

Knowledge-based Transfer Learning Explanation  [code] likewise sought to boost machine learning applications in decision making by making more human-centric explanations, relevant to transfer learning (utilizing models developed in one domain as the starting point for training in another domain). The authors proposed an ontology-based knowledge representation and reasoning framework for human-centric transfer learning explanation. They first modeled a learning domain in transfer learning [including the dataset and the prediction task, with OWL (Web Ontology Language) ontologies], for which they complemented the prediction task-related common sense knowledge using an individual matching and external knowledge importing algorithm. The framework further used a correlative reasoning algorithm to infer three kinds of explanatory evidence with different granularities (general factors, particular narrators and core contexts) to explain a positive feature or a negative transfer from one learning domain to another.

Learning Heterogeneous Knowledge Base Embeddings for Explainable Recommendation addressed the provision of model-generated explanations in recommender systems in structured KG using a knowledge base representation learning framework to embed heterogeneous entities for recommendation, and (based on the embedded knowledge base) a soft matching algorithm was proposed to generate personalized explanations for recommended items. Most of existing collaborative filtering (CF) based recommendation systems work with unstructured data such as ratings, reviews, or images to profile the users for personalized recommendation. In this paper, the authors designed a novel explainable collaborative filtering framework over knowledge graphs. The main building block was an integration of traditional CF with the learning of knowledge base embeddings. They first defined the concept of a user-item knowledge graph (illustrated in Fig. 1 in their paper; note also Fig. 4), which encoded knowledge about the user behaviors and item properties as a relational graph structure. The user-item knowledge graph focused on how to depict different types of user behaviors and item properties over heterogeneous entities in a unified framework. Then, they extended the design philosophy of CF to learn over the knowledge graph for personalized recommendation. For each recommended item, they further conducted fuzzy reasoning over the paths in the knowledge graph based on soft matching, to construct personalized explanations.

Some of the methods for constructing KG and knowledge discovery over KG can also provide evidence to support the understanding of new facts. For example:

  • MOLIERE: Automatic Biomedical Hypothesis Generation System  [projectcode] is a system that can identify connections within biomedical literature. MOLIERE finds the shortest path between two query keywords in the KG, and extends this path to identify a significant set of related abstracts (which, due to the network construction process, share common topics). Topic modeling, performed on these documents using PLDA+, returns a set of plain text topics representing concepts that likely connect the queried keywords, supporting hypothesis generation (for example, on historical findings MOLIERE showed the implicit link between Venlafaxine and HTR1A, and the involvement of DDX3 on Wnt signaling).

  • Finding Streams in Knowledge Graphs to Support Fact Checking  [code] viewed a knowledge graph as a “flow network” and knowledge as a fluid, abstract commodity. They showed that computational fact checking of a (subject, predicate, object) triple then amounted to finding a “knowledge stream” that emanated from the subject node and flowed toward the object node through paths connecting them. Evaluations revealed that this network-flow model was very effective in discerning true statements from false ones, outperforming existing algorithms on many test cases. Moreover, the model was expressive in its ability to automatically discover useful path patterns and relevant facts that may help human fact checkers corroborate or refute a claim.

  • Statistical relational learning (SRL) based approaches (reviewed by Nickel et al. (Kevin Murphy; Volker Tresp) in A Review of Relational Machine Learning for Knowledge Graphs) can be used in conjunction with machine reading and information extraction methods to automatically build KG. In SRL, the representation of an object may contain its relationships to other objects. The data is in the form of a graph, consisting of nodes (entities) and labelled edges (relationships between entities). The main goals of SRL include prediction of missing edges, prediction of properties of nodes, and clustering nodes based on their connectivity patterns. These tasks arise in many settings such as analysis of social networks and biological pathways. An excellent follow-on article (different authors) to A Review of Relational Machine Learning for Knowledge Graphs is On Embeddings as Alternative Paradigm for Relational Learning, which systematically compared KGE and logic-based SRL methods on standard relational classification and clustering tasks – including discussion of the Path Ranking Algorithm (PRA), an easily-interpretable extension of the idea of using random walks of bounded lengths for predicting links in multi-relational knowledge graphs. Relation paths can be regarded as bodies of weighted rules (more precisely, Horn clauses), where the weight specifies how predictive the body of the rule is for the head.

  • The aim of SRL is to learn statistical models from relational or graph-structured data. Three of the main statistical relational learning paradigms include weighted rule learning, random walks on graphs, and tensor factorization (i.e. graph embedding). These methods were mostly developed and studied in isolation, with attempts at understanding the relationship among them or combining them. For example, in their 2016 survey, (A Review of Relational Machine Learning for Knowledge Graphs), Nickel et al. described weighted rules and graph random walks as two separate classes of models for learning from relational data. [One of their claims, regarding the advantages of models based on weighted rule learning compared to other relational models, was that weighted rule learning could be easily explained to a broad range of people.]

  • A 2018 paper by Kazemi and Poole (Bridging Weighted Rules and Graph Random Walks for Statistical Relational Models) finally addressed the relationship between weighted rules and graph random walks. This paper studied the relationship between the Path Ranking Algorithm (PRA; one of the best known relational learning methods in the graph random walk paradigm) and Relational Logistic Regression (RLR; one of the recent developments in weighted rule learning). Their result improved the explainability of models learned through graph random walk, by providing a weighted rule interpretation for them.
    [An academic project for which code is not provided, non-author code for the Path Ranking Algorithm is nonetheless available on GitHub here and here; a deep neural model for relational learning, by Mehran Kazemi is also available.]

Discussed earlierAn Interpretable Reasoning Network for Multi-Relation Question Answering addressed multi-relation question answering via elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. Their Interpretable Reasoning Network (IRN) model dynamically decided which part of an input question should be analyzed at each hop, for which the reasoning module predicted a knowledge base relation (relation triple) that corresponded to the current parsed result. More interestingly regarding this subsection, IRN offered traceable and observable intermediate predictions (see their Fig. 3), facilitating reasoning analysis and failure diagnosis (thereby also allowing manual manipulation in answer prediction).

Given a neural network, we are interested in knowing what features it has learned for making classification decisions. Network interpretation is also crucial for (computer vision) tasks involving humans, like autonomous driving and medical image analysis (and in the NLP domain, clinical diagnosis and recommendation, …). In an interesting approach, Neural Network Interpretation via Fine Grained Textual Summarization, the authors introduced the novel task of interpreting classification models using fine grained textual summarization: along with (image classification) label prediction, the network generated a sentence explaining its decision. For example, a knowledgeable person looking at a photograph of a bird might say, “I think this is a Anna’s Hummingbird because it has a straight bill, a rose pink throat and crown. It’s not a Broad-tailed Hummingbird because the later lacks the red crown.” This kind of textual description carries rich semantic information and is easily understandable, illustrating the use of natural language as a logical medium in which to ground the interpretation of deep convolutional models. Tasks that combine text generation and visual explanation include image captioning and visual question answering (VQA). Although this paper addresses those tasks, the method described (that leverages attention, summarization and natural language) could also be applied in the NLP domain as well (this is addressed in Sections 3 and 4.3 in their paper).

  • This task may be viewed as the reciprocal of “text grounding,” a language to vision task that tries to locate the object in an image referred to by a given text phrase.

  • Note also my Text Grounding: Mapping Text to Knowledge Graphs section in this REVIEW.

[Table of Contents]


In silico modeling – the computational modeling of biochemical, metabolic, pharmacologic or physiologic processes – is a logical extension of in vitro experimentation (In silico Modelling of Physiologic Systems). A natural result of the explosive increase in computing power available to research scientists at continually decreasing cost, in silico modeling combines the advantages of in vivo and in vitro experimentation, not subject to ethical considerations or the lack of control associated with many in vivo experiments (e.g. human or animal experimentation). In silico models also allow researchers to include a virtually unlimited array of parameters, potentially rendering the results more applicable to the organism as a whole.

Examples of recent work in the in silico domain include In vivo and In silico Dynamics of the Development of Metabolic Syndrome  [code], and Systems Modelling of the EGFR-PYK2-c-Met Interaction Network Predicts and Prioritizes Synergistic Drug Combinations for Triple-Negative Breast Cancer  [media].  The University of Connecticut’s “Virtual Cell” (VCell) [paper: Compartmental and Spatial Rule-Based Modeling with Virtual Cell;  project pages here and herecodecommunitymedia]) provides a comprehensive platform for modeling and simulation of cell biology from biological pathways down to cell biophysics. VCell supports biochemical network and rule-based modeling and electrophysiology in compartmental modeling and within cellular geometry.

One might inquiry whether knowledge graphs, which naturally embed biochemical pathways and networks, are amenable to silico perturbation. The Cellular Potts Model (CPM ) is a computational biological model of the collective behavior of cellular structures. CPM allows modeling of many phenomena such as cell migration, clustering, and growth taking adhesive forces – taking environment sensing as well as volume and surface area constraints into account. In silico Modeling for Tumor Growth Visualization implemented a crude graphical model via a CPM, that can be visualized in Cytoscape via cpm-cytoscape  [project]. Those authors also described their work in Machine Learning for In Silico Modeling of Tumor Growth.

See also:

In the synthetic biology domain, Out-of-Equilibrium Microcompartments for the Bottom-Up Integration of Metabolic Functions  [media] recently described the analysis of self-sustained metabolic pathways in a microfluidic platform, water-in-oil droplet microcompartments. The authors developed an assay based on nicotinamide adenine dinucleotide (NADH) fluorescence to quantify the metabolic state of the microcompartments. The minimal metabolism was constructed from a reaction converting glucose-6-phosphate into 6-phosphogluconolactone. The reaction was catalysed by glucose-6-phosphate dehydrogenase, an enzyme involved in the pentose phosphate pathway. A key feature integrated the ability to function under conditions where the reaction was sustained independently of the cofactor stoichiometry. The full conversion of the metabolic substrate required the regeneration of the cofactor NAD+: a regeneration module made of inverted membrane vesicles extracted from E. coli.

The metabolic pathways exemplified in that microfluidic approach are easily modeled in a knowledge graph, suggesting the possibility of the in silico modeling of pathway reactions in parallel with “wet lab,” systems biology approaches. Other relatively easily-constructed models (e.g.: bioengineered microbes and viruses on microbiological media; transgenic rodents; etc.) could likewise be modeled in a combined systems biology, in silico modeling approach. In this regard, clustered regularly interspaced short palindromic repeats/CRISPR associated protein 9 (CRISPR/Cas9) gene editing technology is especially relevant; for example, note Section 3 in CRISPR/Cas9-Based Genome Editing for Disease Modeling and Therapy: Challenges and Opportunities for Nonviral Delivery, and The Present and Future of Genome Editing in Cancer Research. CRISPR/Cas9 for Cancer Research and Therapy provides an excellent review and perspective on this technology.

Reinforcement learning (RL) can be applied in optimizing chemical/biochemical reactions. Optimizing Chemical Reactions with Deep Reinforcement Learning  [code] showed that their RL model outperformed a state of the art algorithm, and generalized to dissimilar underlying mechanisms. Combined with LSTM to model the policy function, the RL agent optimized the chemical reaction with the Markov decision process (MDP) characterized by {S, A, P, R}, where S was the set of experimental conditions (like temperature, pH, etc), A was the set all possible actions that can change the experimental conditions, P was the transition probability from current experiment condition to the next condition, and R was the reward which is a function of the state. Their Deep Reaction Optimizer model iteratively recorded the results of a chemical reaction and chose new experimental conditions to improve the reaction outcome, outperforming a state of the art black box optimization algorithm by using 71% fewer steps on both simulations and real reactions. Furthermore, they introduced an efficient exploration strategy by drawing the reaction conditions from certain probability distributions, which resulted in an improvement on “regret” from 0.062 to 0.039 compared with a deterministic policy (they used “regret” to evaluate the performance of the model ). For the optimization of real-world reactions, the authors carried out four experiments in microdroplets (Scheme 2 in their paper: synthesis of ribose phosphate, etc.) and recorded the production yield. Combining the efficient exploration policy with accelerated microdroplet reactions, their Deep Reaction Optimizer not only served as an efficient and effective reaction optimizer (optimal reaction conditions were determined in 30 min for the four reactions considered), it also provided a better understanding of the mechanism of chemical reactions than that obtained using traditional approaches.

The examples above hint at the potential advances afforded by machine learning in the biochemical/medical domain, which also include the prediction of biomolecular secondary structure (e.g. rawMSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete), biodesign (e.g of new anticancer drugs) and inverse molecular design (very well-reviewed in the July 2018 Science paper Inverse Molecular Design using Machine Learning: Generative Models for Matter Engineering), and numerous other applications. These approaches offer excellent opportunities for collaboration in the advancement of our understanding of metabolism, cellular signaling, regulatory mechanisms, and disease (including cancer).

[Note that the Applicant, Dr. Stuart, has ~30 years bench-side molecular genetics basic research experience.]