### Technical Review

Natural Language Understanding

# NATURAL LANGUAGE UNDERSTANDING

## Introduction

[Image source (slide 8). Click image to open in new window.]

Natural language understanding (NLU) is a subtopic of natural language processing that deals with machine reading comprehension. Natural language understanding is considered an AI-hard problem. There is considerable interest NLU because of its application to information retrieval/extraction, text categorization, summarization, question answering, recommendation, and large-scale content analysis. Advances in NLU offer tremendous promise for the analysis of biomedical and clinical text – which due to the use of technical, domain-specific jargon is particularly challenging for traditional NLP approaches. Some of these challenges and difficulties are described in the August 2018 post NLP’s Generalization Problem, and How Researchers are Tackling It  [discussion].

Machine learning is particularly well suited to assisting and even supplanting many standard NLP approaches (for a good review see Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities (Jun 2018)). Language models, for example, provide improved understanding of the semantic content and latent (hidden) relationships in documents. Machine based natural language understanding (NLU) is a fundamental requirement for robust, human level performance in tasks such as information retrieval, text summarization, question answering, textual entailment, sentiment analysis, reading comprehension, commonsense reasoning, recommendation, etc.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Recent developments in NLP and ML that I believe are particularly important to advancing NLU include:

• understanding the susceptibility of QA systems to adversarial challenge;

• the development of deeply-trained/pretrained language models;

• transfer learning and multitask learning;

• reasoning over graphs;

• the development of more advanced memory and attention-based architectures; and,

• incorporating external memory mechanisms; e.g., a differentiable neural computer, which is essentially an updated version of a neural Turing machine (What Is the Difference between Differentiable Neural Computers and Neural Turing Machines?). Relational database management systems (RDBMS), textual knowledge stores (TKS) and knowledge graphs (KG) also represent external knowledge stores, that may possibly be leveraged as potential external memory resources of external memory architectures suitable for NLP and ML.

DeepMind’s recent paper Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies (Aug 2018) addressed preserving and reusing past knowledge (memory) via unsupervised representation learning using a variational autoencoder: VASE (Variational Autoencoder with Shared Embeddings). VASE automatically detected shifts in data distributions and allocated spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting:

"... thanks to learning a generative model of the observed environments, we can prevent **catastrophic forgetting** by periodically "hallucinating" (i.e. generating samples) from past environments using a snapshot of VASE, and making sure that the current version of VASE is still able to model these samples. A similar "dreaming" feedback loop was used in Lifelong Generative Modeling, ..."

[Image source. Click image to open in new window.]

• For similar, prior work by other authors (cited) that also used a variational autoencoder, see Lifelong Generative Modeling, below.

• As noted, each of the papers cited above addressed the issue of catastrophic forgetting. Interestingly, the Multitask Question Answering Network (MQAN), described in Richard Socher’s “decaNLP/MQAN” paper, attained robust multitask learning, performing nearly as well or better in the multitask setting as in the single task setting for each task despite being capped at the same number of trainable parameters in both. … This suggested that MQAN successfully used trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limited catastrophic forgetting.

Lifelong learning is the problem of learning multiple consecutive tasks in a sequential manner where knowledge gained from previous tasks is retained and used for future learning. It is essential towards the development of intelligent machines that can adapt to their surroundings. Lifelong Generative Modeling (Sep 2018), by authors at the University of Geneva and the Geneva School of Business Administration, focused on a lifelong learning approach to generative modeling where we continuously incorporate newly observed distributions into their learnt model. We did so through a student-teacher variational autoencoder architecture which allowed them to learn and preserve all the distributions seen to that point without the need to retain the past data nor the past models. Through the introduction of a novel cross-model regularizer, inspired by a Bayesian update rule, the student model leveraged the information learnt by the teacher, which acted as a summary of everything seen to that point. The regularizer had the additional benefit of reducing the effect of catastrophic interference that appears when sequences of distributions are learned. They demonstrated its efficacy in learning sequentially observed distributions as well as its ability to learn a common latent representation across a complex transfer learning scenario.

[Image source. Click image we learn over to open in new window.]

[Image source. Click image we learn over to open in new window.]

Continual learning is the ability to sequentially learn over time by accommodating knowledge while retaining previously learned experiences. Neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on previously learned tasks when tasks are presented one at a time. This problem is called catastrophic forgetting. Continual Classification Learning Using Generative Models (Oct 2018) by authors at the University of Geneva and the Geneva School of Business Administration propose a classification model that learns continuously from sequentially observed tasks, while preventing catastrophic forgetting. We build on [our previous work on] the lifelong generative capabilities of Lifelong Generative Modeling and extend it to the classification setting by deriving a new variational bound on the joint log likelihood, $\small log p(x,y)$.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Google Brain’s A Simple Method for Commonsense Reasoning (Jun 2018) [codeslides  |  local copydiscussiondiscussion] presented a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to the method was the use of an array of large RNN language models that operated at word or character level, trained on a massive amount of unlabeled data, to score multiple choice questions posed by commonsense reasoning tests.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “A unique feature of Winograd Schema questions is the presence of a special word that decides the correct reference choice. In the above example, ‘big’ is this special word. When ‘big’ is replaced by ‘small’, the correct answer switches to ‘the suitcase’. Although detecting this feature is not part of the challenge, further analysis shows that our system successfully discovers this special word to make its decisions in many cases, indicating a good grasp of commonsense knowledge.”

• This paper was subsequently savaged in an October, 2018 commentary, A Simple Machine Learning Method for Commonsense Reasoning? A Short Commentary on Trinh & Le (2018):

A Concluding Remark. The data-driven approach in AI has without a doubt gained considerable notoriety in recent years, and there are a multitude of reasons that led to this fact. While the data-driven approach can provide some useful techniques for practical problems that require some level of natural language processing (text classification and filtering, search, etc.), extrapolating the relative success of this approach into problems related to commonsense reasoning, the kind that is needed in true language understanding, is not only misguided, but may also be harmful, as this might seriously hinder the field, scientifically and technologically.”

A Simple Neural Network Module for Relational Reasoning (Jun 2017) [DeepMind blog;  non-author code here and here;  discussion herehere and here] by DeepMind described Relation Networks, a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning including visual question answering, text-based question answering using the bAbI suite of tasks, and complex reasoning about dynamic physical systems. They showed that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with relational networks, to implicitly discover and learn to reason about entities and their relations.

[Image source. Click image to open in new window.]

While Relation Networks – introduced by Santoro et al. (2017) [DeepMind’s “A Simple Neural Network Module for Relational Reasoning,” above – demonstrated strong relational reasoning capabilities, its rather shallow architecture (a single-layer design) only considered pairs of information objects making it unsuitable for problems requiring reasoning across a higher number of facts. To overcome this limitation, authors at the University of Lübeck presented proposed Multi-layer Relation Networks (Nov 2018) [code], a multi-layer relation network architecture which enabled successive refinements of relational information through multiple layers. They showed that the increased depth allowed for more complex relational reasoning, by applying it to the bAbI 20 QA dataset, solving all 20 tasks with joint training and surpassing the state of the art results.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Natural Language Understanding - Introduction:

• “The NLP and ML communities have long been interested in developing models capable of common-sense reasoning, and recent works have significantly improved the state of the art on benchmarks like the Winograd Schema Challenge (WSC). Despite these advances, the complexity of tasks designed to test common-sense reasoning remains under-analyzed. In this paper, we make a case study of the Winograd Schema Challenge and, based on two new measures of instance-level complexity, design a protocol that both clarifies and qualifies the results of previous work. Our protocol accounts for the WSC’s limited size and variable instance difficulty, properties common to other common-sense benchmarks. Accounting for these properties when assessing model results may prevent unjustified conclusions.”

## Word Embeddings

For a good overview see Learning Word Embedding  [local copy].

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers (Sebastian Ruder provides a good overview; see also this excellent post, Introduction to Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov et al. (Sep 2013) Efficient Estimation of Word Representations in Vector Space – the “word2vec” paper].

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• For additional background and discussion see:

Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g.,

vBerlin - vGermany + vFrance ~ vParis
vking - vman + vwoman ~ vqueen.

The for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. ”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes. ” The full reference is: Y. Bengio et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – viz-a-viz my previous comment (I’m being rather critical here) – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.

For a theoretical discussion of word vectors, see Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline  [codediscussion], which is a critique/extension of A Latent Variable Model Approach to PMI-based Word Embeddings. In addition to proposing a new generative model – a dynamic version of the log-linear topic model of Mnih and Hinton (2007) [Three New Graphical Models for Statistical Language Modelling] – the paper provided a theoretical justification for nonlinear models like PMI, word2vec, and GloVe. It also helped explain why low dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013)  [see the algebraic examples, above]. Experimental support was provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.

Related, Sebastion Ruder recently provided a summary of ACL 2018 highlights, including a subsection entitled Understanding Representations: “It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture.”

### Representation Learning

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New Perspectives (Apr 2014); see also the excellent blog posts Deep Learning, NLP, and Representations by Chris Olah, and An introduction to representation learning by Michael Alcorn). Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning).

In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

• the meanings of words;
• how words are combined to form concepts (i.e., syntax); and,
• how concepts relate to the task at hand.

Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference (Oct 2018) [discussion] by the Paul G. Allen School of Computer and Science Engineering, and Facebook AI Research, proposed new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. Their pairwise embeddings were computed as a compositional function of each word’s representation, which was learned by maximizing the pointwise mutual information (PMI) with the contexts in which the two words co-occurred. They added these representations to the cross-sentence attention layer of existing inference models (e.g. BiDAF for QA, ESIM for NLI), instead of extending or replacing existing word embeddings. Experiments showed a gain of 2.72% on the recently released SQuAD2.0 and 1.3% on MultiNLI. Our representations also aid in better generalization with gains of around 6-7% on adversarial SQuAD datasets, and 8.8% on the adversarial entailment test set by Glockner et al.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Representation Learning \| Word Embeddings:

• “A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as Skip-Gram with Negative Sampling (SGNS). We provide a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. Our theory has several implications. Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.”

• [Section 5] Even though vector algebra is surprisingly effective at solving word analogies, the csPMI Theorem reveals two reasons for why an analogy may be unsolvable in a given embedding space: polysemy and corpus bias. …

• Dynamic Meta-Embeddings for Improved Sentence Representations (Kyunghyun Cho and colleagues at Facebook AI Research; Sep 2018) [projectcodediscussion]

• “While one of the first steps in many NLP systems is selecting what pre-trained word embeddings to use, we argue that such a step is better left for neural networks to figure out by themselves. To that end, we introduce dynamic meta-embeddings, a simple yet effective method for the supervised learning of embedding ensembles, which leads to state-of-the-art performance within the same model class on a variety of tasks. We subsequently show how the technique can be used to shed new light on the usage of word embeddings in NLP systems.”

“We argue that the decision of which word embeddings to use in what setting should be left to the neural network. While people usually pick one type of word embeddings for their NLP systems and then stick with it, we find that dynamically learned meta-embeddings lead to improved results. In addition, we showed that the proposed mechanism leads to better interpretability and insightful linguistic analysis. We showed that the network learns to select different embeddings for different data, different domains and different tasks. We also investigated embedding specialization and examined more closely whether contextualization helps. To our knowledge, this work constitutes the first effort to incorporate multi-modal information on the language side of image-caption retrieval models; and the first attempt at incorporating meta-embeddings into large-scale sentence-level NLP tasks.”

[Image source. Click image to open in new window.]
• End-to-End Retrieval in Continuous Space (Google AI: Nov 2018)

“Most text-based information retrieval (IR) systems index objects by words or phrases. These discrete systems have been augmented by models that use embeddings to measure similarity in continuous space. But continuous-space models are typically used just to re-rank the top candidates. We consider the problem of end-to-end continuous retrieval, where standard approximate nearest neighbor (ANN) search replaces the usual discrete inverted index, and rely entirely on distances between learned embeddings. By training simple models specifically for retrieval, with an appropriate model architecture, we improve on a discrete baseline by 8% and 26% (MAP) on two similar-question retrieval tasks. We also discuss the problem of evaluation for retrieval systems, and show how to modify existing pairwise similarity datasets for this purpose.”

[Image source. Click image to open in new window.]
• PAC-Bayes Analysis of Sentence Representation (Feb 2019).  “Learning sentence vectors from an unlabeled corpus has attracted attention because such vectors can represent sentences in a lower dimensional and continuous space. Simple heuristics using pre-trained word vectors are widely applied to machine learning tasks. However, they are not well understood from a theoretical perspective. We analyze learning sentence vectors from a transfer learning perspective by using a PAC-Bayes bound that enables us to understand existing heuristics. We show that simple heuristics such as averaging and inverse document frequency weighted averaging are derived by our formulation. Moreover, we propose novel sentence vector learning algorithms on the basis of our PAC-Bayes analysis.” … “Our analysis of sentence vectors is a first step towards understanding of practical sentence vector representation learning.”

### Addressing Hypernymy and Polysemy with Word Embeddings

Word embeddings have many uses in NLP. For example, polysemy – words or phrases with different, but related, meanings [e.g. “Washington” may refer to “Washington, DC” (location) or “George Washington” (person)] – pose one of many challenges to NLP. Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal.

• Wikipedia2Vec : An Optimized Implementation for Learning Embeddings from Wikipedia (Dec 2018) [project/code;  discussion: reddit  |  Hacker Newsembedding projector], by Studio Ousia + Japanese academic collaborators, presented Wikipedia2Vec, an open source tool for learning embeddings of words and entities from Wikipedia. This tool enabled users to easily obtain high-quality embeddings of words and entities from a Wikipedia dump with a single command. The learned embeddings could be used as features in downstream natural language processing (NLP) models. The tool can be installed via PyPI [Python]. The source code, documentation, and pretrained embeddings for 12 major languages can be obtained at GitHub.

[Image source. Click image to open in new window.]

• Author comment (Ikuya Yamada, Hacker News, in response to “I’m a little confused, its just word2vec pretrained on content of wikipedia?” → “ Unlike word2vec, this tool learns embeddings of entities (i.e., entries in Wikipedia) as well as words. And although the model implemented in this tool is based on Word2vec’s skip-gram model, it is extended using two submodels (Wikipedia link graph model and anchor context model). Please refer to the documentation for details.”

A probabilistic extension of fastTextProbabilistic FastText for Multi-Sense Word Embeddings – can produce accurate representations of rare, misspelt, and unseen words. Probabilistic FastText achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words:

• “Our multimodal word representation can also disentangle meanings, and is able to separate different senses in foreign polysemies. In particular, our models attain state-of-the-art performance on SCWS, a benchmark to measure the ability to separate different word meanings, achieving 1.0% improvement over a recent density embedding model W2GM (Athiwaratkun and Wilson, 2017). To the best of our knowledge, we are the first to develop multi-sense embeddings with high semantic quality for rare words.”

• “… we show that our probabilistic representation with subword mean vectors with the simplified energy function outperforms many word similarity baselines and provides disentangled meanings for polysemies.”

• “We show that our embeddings learn the word semantics well by demonstrating meaningful nearest neighbors. Table 1 shows examples of polysemous words such as ‘rock’, ‘star’, and ‘cell’. shows the nearest neighbors of polysemous words. We note that subword embeddings prefer words with overlapping characters as nearest neighbors. For instance, ‘rock-y’, ‘rockn’, and ‘rock’ are both close to the word ‘rock’. For the purpose of demonstration, we only show words with meaningful variations and omit words with small character-based variations previously mentioned. However, all words shown are in the top-100 nearest words. We observe the separation in meanings for the multi-component case; for instance, one component of the word ‘bank’ corresponds to a financial bank whereas the other component corresponds to a river bank. The single-component case also has interesting behavior. We observe that the subword embeddings of polysemous words can represent both meanings. For instance, both ‘lava-rock’ and ‘rock-pop’ are among the closest words to ‘rock’.”

Context Mover’s Distance & Barycenters: Optimal Transport of Contexts for Building Representations (Aug 2018) [discussion] proposed a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding). Their method gives a novel perspective for building rich and powerful feature representations that simultaneously capture uncertainty (via a distributional estimate) and interpretability (with the optimal transport map). Among their various applications (e.g. entailment detection; semantic similarity), they proposed to represent sentences as probability distributions to better capture the inherent uncertainty and polysemy, arguing that histograms (or probability distributions) over embeddings allows the capture of more of this information than point-wise embeddings, alone. They discuss hypernymy detection in Section 7; for this purpose, they relied on a recently proposed model that which explicitly modeled what information is known about a word by interpreting each entry of the embedding as the degree to which a certain feature is present.

[Image source. Click image to open in new window.]

This image is particularly illustrative (click, and click again, to enlarge):

[Image source. Click image to open in new window.]

• “While existing methods represent each entity of interest (e.g., a word) as a single point in space (e.g., its embedding vector), we here propose a fundamentally different approach. We represent each entity based on the histogram of contexts (co-occurring with it), with the contexts themselves being points in a suitable metric space. This allows us to cast the distance between histograms associated with the entities as an instance of the optimal transport problem [see Section 3 for a background on optimal transport]. For example, in the case of words as entities, the resulting framework then intuitively seeks to minimize the cost of moving the set of contexts of a given word to the contexts of another [note their Fig, 1]. Note that the contexts here can be words, phrases, sentences, or general entities co-occurring with our objects to be represented, and these objects further could be any type of events extracted from sequence data …”

• Regarding semantic embedding, or word sense disambiguation (not explicitly discussed in the paper), their Fig.2 [Illustration of three words, each with their distributional estimates (left), as well as the point estimates of the relevant contexts (middle), as well as joint representation (right)] is very interesting: words in vector space, along with a histogram of their probability distributions over those embedded spaces.

• “Software Release. We plan to make all our code (for all these parts) and our pre-computed histograms (for the mentioned datasets) publicly available on GitHub soon.”  [Not available: 2018-10-07]

Early in 2018 pretrained language models such as ELMo offered another approach to solve the polysemy problem.

### Word Sense Disambiguation

Related to polysemy and named entity disambiguation is word sense disambiguation (WSD). Learning Graph Embeddings from WordNet-based Similarity Measures described a new approach, path2vec, for learning graph embeddings that relied on structural measures of node similarities for generation of training data. Evaluations of the proposed model on semantic similarity and WSD tasks showed that path2vec yielded state of the art results.

In January 2018 Ruslan Salakhutdinov and colleagues proposed a probabilistic graphical model that leveraged a topic model to design a WSD system (WSD-TM ) that scaled linearly with the number of words in the context. Their logistic normal topic model – a variant of latent Dirichlet allocation in which the topic proportions for a document were replaced by WordNet  (sets of synonyms) – incorporated semantic information about synsets as its priors. WSD-TM outperformed state of the art knowledge-based WSD systems.

#### Probing the Role of Attention in Word Sense Disambiguation

Recent work has shown that the encoder-decoder attention mechanisms in neural machine translation (NMT) are different from the word alignment in statistical machine translation. An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation (Oct 2018) focused on analyzing encoder-decoder attention mechanisms, in the case of word sense disambiguation (WSD) in NMT models. They hypothesized that attention mechanisms pay more attention to context tokens when translating ambiguous words, and explored the attention distribution patterns when translating ambiguous nouns. Counterintuitively, they found that attention mechanisms were likely to distribute more attention to the ambiguous noun itself rather than context tokens, in comparison to other nouns. They concluded that the attention mechanism was not the main mechanism used by NMT models to incorporate contextual information for WSD. The experimental results suggested that NMT models learned to encode the contextual information necessary for WSD in the encoder hidden states. For the attention mechanism in Transformer models, they revealed that the first few layers gradually learn to “align” source and target tokens, and the last few layers learn to extract features from the related but unaligned context tokens.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

### Applications of Embeddings in the Biological Sciences

While predicting protein 3D structure from primary amino acid sequences has been a long-standing objective in bioinformatics, definite solutions remain to be found  [discussion]. The most reliable approaches currently available involve homology modeling, which allows assigning a known protein structure to an unknown protein, provided that there is detectable sequence similarity between the two. When homology modeling is not viable de novo techniques, based on physical-based potentials or knowledge-based potentials, are needed. Unfortunately proteins are very large molecules, and the huge amount of available conformations, even for relatively small proteins, makes it prohibitive to fold them even on customized computer hardware.

To address this challenge, knowledge based potentials can be learned from statistics or machine learning methods to infer useful information from known examples of protein structures. This information can be used to constrain the problem, greatly reducing the amount of samples that need to be evaluated when dealing exclusively with physics-based potentials. Multiple sequence alignments (MSA) consists of aligned sequences homologous to the target protein, compressed into position-specific scoring matrices (PSSM, also called sequence profiles) using the fraction of occurrences of different amino acids in the alignment for each position in the sequence. More recently, contact map prediction methods have been at the center of renewed interest; however, their impressive performance is correlated with the amount of sequences in the MSA, and is not as reliable when few sequences are related to the target.

rawMSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete introduced a new approach, called rawMSA, for the de novo prediction of structural properties of proteins. The core idea behind rawMSA was to borrow from the word2vec word embedding from Mikolov et al. (Efficient Estimation of Word Representations in Vector Space), which they used to convert each character (amino acid residue) in the MSA into a floating point vector of variable size, thus representing the residues by the structural property they were trying to predict. Test results from deep neural networks based on this concept showed that rawMSA matched or outperformed the state of the art on three tasks: predicting secondary structure, relative solvent accessibility, and residue-residue contact maps.

### Probing the Effectiveness of Word Embeddings

A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as Skip-Gram with Negative Sampling (SGNS). Towards Understanding Linear Word Analogies (Oct 2018) provided a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. “Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.”

• “In this paper, we provided a rigorous explanation of why - and when - word analogies can be solved using vector algebra. More specifically, we proved that an analogy holds over a set of word pairs in an SGNS or GloVe embedding space with no reconstruction error i.f.f.  the co-occurrence shifted PMI is the same for every word pair. Our theory had three implications. … [See comments below.] … Most importantly, our theory did not make the unrealistic assumptions that past theories have made about the word distribution and vector space, making it much more tenable than previous explanations.”

• [discussion: Hacker News]: “Hi, first author here! Feel free to ask any questions. TL;DR: We prove that linear word analogies hold over a set of ordered pairs (e.g., $\small \text{{(Paris, France), (Ottawa, Canada), …})}$ in an SGNS or GloVe embedding space with no reconstruction error when $\small \text{PMI(x,y) + log p(x,y)}$ is the same for every word pair $\small \text{(x,y)}$. We call this term the csPMI (co-occurrence shifted PMI). This has a number of interesting implications:

1. It implies that Pennington et al. [Socher; Manning] (authors of GloVe) had the right intuition about why these analogies hold.
2. Adding two word vectors together to compose them makes sense, because you’re implicitly downweighting the more frequent word – like TF-IDF or SIF would do explicitly.
3. Using Euclidean distance to measure word dissimilarity make sense because the Euclidean distance is a linear function of the negative csPMI.”

## Memory Based Architectures

For a more detailed description of how neural networks “learn” see my blog post How do Neural Networks "Remember"?  In essence, the answer is that memory forms during the training of the parameters (i.e., the trained weights); the matrix of trained weights are the memory.

Memory (the ability to recall previous facts and knowledge) is a crucial requirement for natural language understanding, reasoning (the process of forming an answer a new question by manipulating previously acquired knowledge), and the guidance of decision making. Without memory, agents must act reflexively according only to their immediate percepts and cannot execute plans that occur over an extended time intervals (Neural Map: Structured Memory for Deep Reinforcement Learning).

Broadly speaking, computational approaches to memory include:

• internal, volatile “short-term” memories algorithmically generated within RNN, LSTM, and self-attention modules;

• external, volatile memories algorithmically generated by neural Turing machines, memory networks, and differential neural computers; and

• external, permanent long-term “memories” embedded within knowledge bases and knowledge graphs (for relevant discussion, see my Entity Linking and Text Grounding subsection).

Neural Architectures with Memory  [local copy] provides an excellent overview of neural memory architectures.

Short term memory architectures are commonly employed in the various models discussed in this REVIEW. For example, RNNLSTM, dynamic memory networks (DMN), etc. serve as “working memory” in summarization, question answering and other tasks. Long short-term memory (LSTM) networks are a specialized type of recurrent neural network (RNN) that are capable of learning long term dependencies as well as short term memories of recent transactions.

However, most machine learning models lack an easy way to read and write to part of a (potentially very large) long-term memory component, and to combine this seamlessly with inference. While RNN can be trained to predict the next word to output after reading a stream of words, their memory (encoded by hidden states and weights) is typically too small and is not compartmentalized enough to accurately remember facts from the past (as the knowledge is compressed into dense vectors, from which those memories are not easily accessed). RNNs are also known to have difficulty in performing memorization, for example the simple copying task of outputting the same input sequence they have just read.

Neural networks that utilize external memories can be classified into two main categories: memories with write operators, and those without (Neural Map: Structured Memory for Deep Reinforcement Learning). Regarding the latter type, memory networks (MemNN, introduced by Jason Weston et al. at Facebook AI Research) are a class of deep networks that jointly learn how to reason with inference components combined with a long-term memory component that can be written to and read from, with the goal of using it for prediction. Instead of using a recurrent matrix to retain information through time, memory networks learn how to operate effectively with the memory component.

Memory networks employ explicit addressable memory, that fixes which memories are stored. For example, at each time step, the memory network would store the past $\small M$ states that have been seen in an environment. Therefore, what is learned by the network is how to access or read from this fixed memory pool, rather than what contents to store within it. In sidestepping the difficulty of learning what information to store in memory, memory networks introduce two main disadvantages: storing a potentially significant amount of redundant information; and, relying on domain experts to choose what to store in the memory (Neural Map: Structured Memory for Deep Reinforcement Learning). The memory network approach has been successful in language modeling and question answering, and was shown to be a successful memory for deep reinforcement learning agents in complex 3D environments (Neural Map: Structured Memory for Deep Reinforcement Learning and references therein).

• Tracking the World State with Recurrent Entity Networks (May 2017) [OpenReview; non-author code here and here], by Jason Weston and Yann LeCun, introduced the Recurrent Entity Network (EntNet). EntNet was equipped with a dynamic long-term memory, which allowed it to maintain and update a representation of the state of the world as it received new data. For language understanding tasks, it could reason on the fly as it read text, not just when it was required to answer a question or respond, as was the case for Jason Weston’s MemN2N memory network (“End-To-End Memory Networks ”). Like a neural Turing machine or differentiable neural computer, EntNet maintained a fixed size memory and could learn to perform location and content-based read and write operations. However, unlike those models, it had a simple parallel architecture in which several memory locations could be updated simultaneously. EntNet set a new state of the art on the bAbI tasks, and was the first method to solve all the tasks in the 10k training examples setting. Weston and LeCun also demonstrated that EntNet could solve a reasoning task which required a large number of supporting facts, which other methods were not able to solve, and could generalize past its training horizon.

In contrast to memory networks, external neural memories having write operations are potentially far more efficient, since they can learn to store salient information for unbounded time steps and ignore any other useless information, without explicitly needing any knowledge a priori on what to store. A prominent research direction on write-based architectures has been recurrent architectures that mimic computer memory systems that explicitly separate memory from computation, analogous to how a CPU (processor/controller) interacts with an external memory (tapes; RAM; GPU) in digital computers. One such model, the Differentiable Neural Computer (DNC) – and its predecessor the Neural Turing Machine (NTM) – structure the architecture to explicitly separate memory from computation. The DNC has a recurrent neural controller that can access an external memory resource by executing differentiable read and write operations. This allows the DNC to act and memorize in a structured manner resembling a computer processor, where read and write operations are sequential and data is store distinctly from computation. The DNC has been used successfully to solve complicated algorithmic tasks, such as finding shortest paths in a graph or querying a database for entity relations.

There has been extensive work in the NLP domain regarding the use of neural Turing machines  (NTM), and to a lesser extent, differentiable neural computers  (DNC). For a slightly dated (current to ~2017) summary listing of NTM and DNC, see my web page (this is a huge file: on slow connections, wait for the page to fully load). Notable, among those papers are the following items.

• Survey of Reasoning using Neural Networks (Mar 2017) provided an excellent summary – including relevant background – of neural network approaches to reasoning and inference with a focus on the need for memory networks (e.g. the MemN2N end-to-end memory network, and large external memories). Among the algorithms surveyed and compared were a LSTM, a NTM with a LSTM controller, and a NTM with a feedforward controller (demonstrating the superior performance of the NTM over the LSTM).

• The model described in Robust and Scalable Differentiable Neural Computer for Question Answering (Jul 2018) was designed as a general problem solver which could be used in a wide range of tasks. Their GitHub repository contains an implementation of a Advanced Differentiable Neural Computer (ADNC), providing more robust and scalable use in Question Answering.

[Image source. Click image to open in new window.]

LSTMs were used in Augmenting End-to-End Dialog Systems with Commonsense Knowledge (Feb 2018), which investigated the impact of providing commonsense knowledge about concepts (integrated as external memory) on human-computer conversation. Their method was based on a NIPS 2015 workshop paper, Incorporating Unstructured Textual Knowledge Sources into Neural Dialogue Systems, which described a method to leverage additional information about a topic using a simple combination of hashing and TF-IDF to quickly identify the most relevant portions of text from the external knowledge source, based on the current context of the dialogue. In that work, three recurrent neural networks (RNNs) were trained: one to encode the selected external information, one to encode the context of the conversation, and one to encode a response to the context. Outputs of these modules were combined to produce the probability that the response was the actual next utterance given the context.

Recent Advances in Neural Program Synthesis (Feb 2018) [discussion] surveys various recurrent, memory, attentional, pointer, neural Turing machine (NTM), differential neural computers (DNC)… architectures.

[Image source. Click image to open in new window.]

### Attention and Memory

Abbreviated Review: Attentional Mechanisms

In mid-2017, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The inherently sequential nature of RNNs precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

An attention mechanism  [local copy] takes into account the input from several time steps to make one prediction. It distributes attention over several hidden states, assigning different weights (importance) to those inputs. An attention mechanism allows a method to focus on task-relevant parts of the input {sequence | graph}, helping it to make better decisions.

The attention mechanism memorizes long source sentences, creating shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element. While the context vector has access to the entire input sequence we don’t need to worry about forgetting (also, the length of these sequences was increased in the recent Transformer-XL model). The alignment between the source and target is learned and controlled by the context vector, which processes three pieces of information: the encoder and decoder hidden states, and the alignment between the source and the target.

The major component in Transformer is the multi-head self-attention mechanism (self attention an attention mechanism relating different positions of a single input sequence). Transformer views the encoded representation of the input as a set of key-value pairs ($\small \mathbf{K},\mathbf{V}$), both of dimension $\small n$ (input sequence length); in the context of neural machine translation, both the keys and values are the encoder hidden states. In the decoder, the previous output is compressed into a query ($\small \mathbf{Q}$ of dimension $\small m$) and the next output is produced by mapping this query and the set of keys and values. Transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys.

\small \begin{align} Attention(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = softmax \left( \frac{\mathbf{QK}^\top}{\sqrt n} \right) \mathbf{V} \end{align}      [Source]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Multi-Head Attention consists of several attention layers running in parallel.  [Image source. Click image to open in new window.]

Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel. The independent attention outputs are simply concatenated and linearly transformed into the expected dimensions. “Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”

The Transformer uses multi-head attention in three different ways [from Section 3.2.3 in Attention Is All You Need]:

• First, in “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (Oct 2016).

• In Transformer architectures, memory is maintained in the key:value pairs. Transformer imitates the classical attention mechanism [e.g., Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014, updated May 2016)] where in encoder-decoder attention layers queries are form previous decoder layer, and the (memory) keys and values are from output of the encoder. Therefore, each position in decoder can attend over all positions in the input sequence.

[Section 3.1] “… The probability $\small \alpha_{ij}$, or its associated energy $\small e_{ij}$, reflects the importance of the annotation $\small h_j$ with respect to the previous hidden state $\small s_{i-1}$ in deciding the next states $\small i$ and generating $\small y_i$. Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.”

• Second, the encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

• (Third) Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot- product attention by masking out (setting to $\small -\infty$) all values in the input of the softmax which correspond to illegal connections.

Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel. The independent attention outputs are simply concatenated and linearly transformed into the expected dimensions.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

In June 2017 Vaswani et al. at Google proposed a new simple network architecture, Transformer, that was based solely on attention mechanisms – entirely dispensing with recurrence and convolutions, and allowing significantly more parallelization [Attention Is All You Need (Jun 2017; updated Dec 2017);  code]. Transformer has been shown to perform strongly on machine translation, document generation, syntactic parsing and other tasks. Experiments on two machine translation tasks showed these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Transformer also generalized well to other tasks; for example, it was successfully applied to English constituency parsing, both with large and limited training data.

In July 2018, nearly a year after they introduced their original “Attention Is All You NeedTransformer architecture (Jun 2017; updated Dec 2017), Google Brain/DeepMind released an updated Universal Transformer version, discussed in the Google AI blog post Moving Beyond Translation with the Universal Transformer [Aug 2018;  discussion]:

Attention mechanisms, like Google’s Transformer, have become an integral part of sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

Attention Models in Graphs: A Survey (Jul 2018) provides a recent focused survey of the literature on the emerging field of graph attention models.

Abbreviated Review: Memory Architectures

Jason Weston et al. (Facebook AI Research) introduced Memory Networks (MemNN) in Oct 2014 (updated Nov 2015).

[Image source. Click image to open in new window.]

Although that paper lacked a schematic, the memory network architecture is well described in the paper, and this image:

["memory network (MemNN)" (image source; click image to open in new window)]

• A memory network consists of a memory $\small \mathbf{m}$ (an array of objects (for example an array of vectors or an array of strings) indexed by $\small \mathbf{m}_i$) and four (potentially learned) components $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$ and $\small \mathbf{R}$ as follows:

• $\small \mathbf{I}$ (input feature map): converts the incoming input to the in ternal feature representation.
• $\small \mathbf{G}$ (generalization): updates old memories given the new input. We call this generalization as there is an opportunity for the network to compress and generalize its memories at this stage for some intended future use.
• $\small \mathbf{O}$ (output feature map): produces a new output (in the feature representation space), given the new input and the current memory state.
• $\small \mathbf{R}$ (response): converts the output into the response format desired. For example, a textual response or an action.
• $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$ and $\small \mathbf{R}$ can all potentially be learned components and make use of any ideas from the existing machine learning literature. In question answering systems, for example, the components may be instantiated as follows:

• $\small \mathbf{I}$ can make use of standard pre-processing such as parsing, coreference, and entity resolution. It could also encode the input into an internal feature representation by converting from text to a sparse or dense feature vector.
• The simplest form of $\small \mathbf{G}$ is to store $\small \mathbf{I}(\mathbf{x})$ in a “slot” in the memory: $\small \mathbf{m}_{\mathbf{H}(\mathbf{x})} = \mathbf{I}(\mathbf{x})$, where $\small \mathbf{H}(\cdot)$ is a function selecting the slot. That is, $\small \mathbf{G}$ updates the index $\small \mathbf{H}(\mathbf{x})$ of $\small \mathbf{m}$, but all other parts of the memory remain untouched.
Restated yet again: the simplest form of $\small \mathbf{G}$ is to introduce a function $\small \mathbf{H}$ which maps the internal feature representation produced by $\small \mathbf{I}$ to an individual memory slot, and just updates the memory at $\small \mathbf{H(I(x))}$. question* $\small \mathbf{O}$ reads from memory and performs inference to deduce the set of relevant memories needed to perform a good response.
• $\small \mathbf{R}$ would produce the actual wording of the question-answer based on the memories found by $\small \mathbf{O}$. For example, $\small \mathbf{R}$ could be an RNN conditioned on the output of $\small \mathbf{O}$
• Note that the original memory network (MemNN, above) lacked an attention mechanism.

• When the components $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$, & $\small \mathbf{R}$ (above) were neural networks, the authors (Weston et al.) described the resulting system as a memory neural network (MemN2N – which they built for QA (question answering) problems.

The highly cited MemN2N architecture (End-To-End Memory Networks (Nov 2015) [code;  non-author code herehere and here;  discussion here and here]), introduced by Jason Weston and colleagues at Facebook AI Research, is a recurrent attention model over an external memory. The model involved multiple computational steps (termed “hops”) per output symbol. In this RNN architecture, the recurrence read from a possibly large external memory multiple times before outputting a symbol. The architecture was trained end-to-end and hence required significantly less supervision during training; the flexibility of the model allowed them to apply it to tasks as diverse as synthetic question answering and language modeling.

[Image source. Click image to open in new window.]

For question answering MemN2N was competitive with memory networks but with less supervision; for language modeling, MemN2N demonstrated performance comparable to RNN and LSTM on the Penn Treebank and Text8 datasets. In both cases they showed that the key concept of multiple computational hops yielded improved results. Unlike a traditional RNN, the average activation weight of memory positions during the memory hops did not decay exponentially: it had roughly the same average activation across the entire memory (Fig. 3 in the image, above), which may have been the source of the observed improvement in language modeling.

“We also vary the number of hops and memory size of our MemN2N, showing the contribution of both to performance; note in particular that increasing the number of hops helps. In Fig. 3, we show how MemN2N operates on memory with multiple hops. It shows the average weight of the activation of each memory position over the test set. We can see that some hops concentrate only on recent words, while other hops have more broad attention over all memory locations, which is consistent with the idea that successful language models consist of a smoothed n-gram model and a cache. Interestingly, it seems that those two types of hops tend to alternate. Also note that unlike a traditional RNN, the cache does not decay exponentially: it has roughly the same average activation across the entire memory. This may be the source of the observed improvement in language modeling.

[Image source. Click image to open in new window)]

Here is the MemN2N architecture, from the paper End-To-End Memory Networks:

[Image source. Click image to open in new window]

• “Our model takes a discrete set of inputs x1, …, xn that are to be stored in the memory, a query q, and outputs an answer a. Each of the xi, q, and a contains symbols coming from a dictionary with V words. The model writes all x to the memory up to a fixed buffer size, and then finds a continuous representation for the x and q. The continuous representation is then processed via multiple hops to output a. This allows backpropagation of the error signal through multiple memory accesses back to the input during training.”

• “… The entire set of {xi} are converted into memory vectors {mi} of dimension d computed by embedding each xi in a continuous space, in the simplest case, using an embedding matrix A (of size d × V). …”  ←  i.e., the vectorized input is stored as external memory

A recent paper from DeepMind (Relational Recurrent Neural Networks (Jun 2018) [code; discussion here and here] is also of interest in regard to language modeling and reasoning over natural language text. While memory based neural networks model temporal data by leveraging an ability to remember information for long periods, it is unclear whether they also have an ability to perform complex relational reasoning with the information they remember. In this paper the authors first confirmed their intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected (i.e., tasks involving relational reasoning). They then improved upon these deficits by using a new memory module, a Relational Memory Core (RMC; Fig. 1 in that paper), which showed large gains in reinforcement domains including language models (Sections 4.3 and 5.4).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Critique. While the DeepMind RMC model combined features of dynamic memory with an attention mechanism similar to Jason Weston’s DMN+ model (cited), they neither discuss nor compare the two models. Disappointingly, the DeepMind paper lacks ablation studies or other work needed to better understand their model: “… we cannot necessarily make any concrete claims as to the causal influence of our design choices on the model’s capacity for relational reasoning, or as to the computations taking place within the model and how they may map to traditional approaches for thinking about relational reasoning. Thus, we consider our results primarily as evidence of improved function – if a model can better solve tasks that require relational reasoning, then it must have an increased capacity for relational reasoning, even if we do not precisely know why it may have this increased capacity. ”

• As shown in the first image, above, RMC module employs multi-head dot product attention (MHDPA) – Google’s Transformer seq2seq self-attention mechanism.

An aside regarding this DeepMind Relational Recurrent Neural Networks paper: another DeepMind paper, Relational Deep Reinforcement Learning  [discussion] (released at the same time) introduced an approach to deep reinforcement learning that improved upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It used the computationally efficient MHDPA self-attention model to iteratively reason about the relations between entities in a scene and to guide a model-free policy. In these models entity-entity relations are explicitly computed when considering the messages passed between connected nodes of the graph (i.e. the relations between entities in a scene). MHDPA computes interactions between those entities (attention weights); an (underlying) graph defines the path to a solution, with the attention weights driving the solution. [Very cool.]

[Image source. Click image to open in new window.]

• This takes a minute to explain, but it’s a very neat game/task.

“Box-World” is a perceptually simple but combinatorially complex environment that requires abstract relational reasoning and planning. It consists of a 12 x 12 pixel room with keys and boxes randomly scattered. The room also contains an agent, represented by a single dark gray pixel, which can move in four directions: up, down, left, right. Keys are represented by a single colored pixel. The agent can pick up a loose key (i.e., one not adjacent to any other colored pixel) by walking over it. Boxes are represented by two adjacent colored pixels – the pixel on the right represents the box’s lock and its color indicates which key can be used to open that lock; the pixel on the left indicates the content of the box which is inaccessible while the box is locked.

To collect the content of a box the agent must first collect the key that opens the box (the one that matches the lock’s color) and walk over the lock, which makes the lock disappear. At this point the content of the box becomes accessible and can be picked up by the agent. Most boxes contain keys that, if made accessible, can be used to open other boxes. One of the boxes contains a gem, represented by a single white pixel. The goal of the agent is to collect the gem by unlocking the box that contains it and picking it up by walking over it. Keys that an agent has in possession are depicted in the input observation as a pixel in the top-left corner.

[Image source. Click image to open in new window.]

In each level there is a unique sequence of boxes that need to be opened in order to reach the gem. Opening one wrong box (a distractor box) leads to a dead-end where the gem cannot be reached and the level becomes unsolvable. There are three user-controlled parameters that contribute to the difficulty of the level: (1) the number of boxes in the path to the goal (solution length); (2) the number of distractor branches; (3) the length of the distractor branches. In general, the task is computationally difficult for a few reasons. First, a key can only be used once, so the agent must be able to reason about whether a particular box is along a distractor branch or along the solution path. Second, keys and boxes appear in random locations in the room, emphasising a capacity to reason about keys and boxes based on their abstract relations, rather than based on their spatial positions.

Figure 4 shows a trial run along with the visualization of the attention weights. For one of the attention heads, each key attends mostly to the locks that can be unlocked with that key. In other words, the attention weights reflect the options available to the agent once a key is collected. For another attention head, each key attends mostly to the agent icon. This suggests that it is relevant to relate each object with the agent, which may, for example, provide a measure of relative position and thus influence the agent’s navigation.

[Image source. Click image to open in new window.]

An Interpretable Reasoning Network for Multi-Relation Question Answering (Jun 2018) [code] is another very interesting paper which addressed multi-relation question answering via elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. They presented a novel Interpretable Reasoning Network (IRN) model that employed an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decided which part of an input question should be analyzed at each hop, for which the reasoning module predicted a knowledge base relation (relation triple) that corresponded to the current parsed result. The predicted relation was used to update the question representation as well as the state of the reasoning module, and helped the model to make the next-hop reasoning. At each hop, an entity was predicted based on the current state of the reasoning module.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• IRN yielded state of the art results on two datasets. More interestingly and different from previous models, IRN offered traceable and observable intermediate predictions (see their Fig. 3), facilitating reasoning analysis and failure diagnosis (thereby also allowing manual manipulation in answer prediction). Whereas single-relation questions such as “How old is Obama?” can be answered by finding one fact triple in knowledge base/graph (this task has been widely studied), this work addressed multi-relation QA. Reasoning over multiple fact triples was required to answer multi-relation questions such as “Name a soccer player who plays at forward position at the club Borussia Dortmund.”, where more than one entity and relation are mentioned.

On the datasets evaluated, IRN outperformed other baseline models such as Weston’s MemN2N model (see in the IRN paper). Through vector (space) representation, IRN could also establish reasonable mappings between knowledge base relations and natural language, such as linking “profession” to words like “working”, “profession”, and “occupation” (see their ), which addresses the issue of out-of-vocabulary (OOV) words.

Working memory is an essential component of reasoning – the process of forming an answer a new question by manipulating previously acquired knowledge. Memory modules are often implemented as a set of memory slots without explicit relational exchange of content, which does not naturally match multi-relational domains in which data is structured. Relational dynamic memory networks (Aug 2018) designed a new model, Relational Dynamic Memory Network (RDMN), to fill this gap. The memory could have single or multiple components, each of which realized a multi-relational graph of memory slots. The memory, dynamically updated in the reasoning process, was controlled by a central controller. The architecture is shown in their Fig. 1 (RDMN with single component memory): at the first step, the controller reads the query; the memory is initialized by the input graph, one node embedding per memory cell. Then during the reasoning process, the controller iteratively reads from and writes to the memory. Finally, the controller emits the output. RDMN performed well on several domains, including molecular bioactivity and chemical reactions.

• Their Discussion provides an excellent summary (paraphrased here) that is relevant to this REVIEW:

“The problem studied in this paper belongs to a broader program known as machine reasoning: unlike the classical focus on symbolic reasoning, here we aim for a learnable neural reasoning capability. We wish to emphasize that RDMN is a general model for answering any query about graph data. While the evaluation in this paper is limited to function calls graph, molecular bioactivity and chemical reaction, RDMN has a wide range of potential applications. For example, a drug (query) may act on the network of proteins as a whole (relational memory). In recommender systems, user can be modeled as a multi-relational graph (e.g., network between purchased items, and network of personal contacts); and query can be anything about them (e.g., preferred attributes or products). Similarly in healthcare, patient medical record can be modeled as multi-relational graphs about diseases, treatments, familial and social contexts; and query can be anything about the presence and the future of health conditions and treatments.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

This work builds on preliminary work, described in Graph Memory Networks for Molecular Activity Prediction (Jan 2018) [non-author code].

Collectively, the works discussed above suggest that:

Memory-augmented neural networks such as MemN2N solve a compartmentalization problem with a slot-based memory matrix but may have a harder time allowing memories to interact/relate with one another once they are encoded, whereas LSTM pack all information into a common hidden memory vector, potentially making compartmentalization and relational reasoning more difficult (Relational Recurrent Neural Networks).

Denny Britz provided an excellent discussion of attention vs. memory in Attention and Memory in Deep Learning and NLP (Jan 2016). Also, Attention in Long Short-Term Memory Recurrent Neural Networks (Jun 2017) discussed a limitation of the LSTM-based encode-decoder architectures (i.e., fixed-length internal representations of the input sequence – note, e.g., ELMo) that attention mechanisms overcome: allowing the network to learn where to pay attention in the input sequence for each item in the output sequence.

Particularly relevant to this REVIEW are the examples of attention in textual entailment (drawn from the DeepMind paper Reasoning about Entailment with Neural Attention (Mar 2016) [non-author code here and here]) and text summarization (drawn from Jason Weston’s A Neural Attention Model for Abstractive Sentence Summarization) – the benefits of which are immediately obvious upon reviewing that work.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Also relevant to this discussion, in Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (May 2018) researchers at the P.G. Allen School (University of Washington) discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of S-RNN (simple RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Thus, the LSTM gates themselves are powerful recurrent models that provide more representational power than previously realized. They noted that:

1. The LSTM weights are vectors, while attention typically computes scalar weights; i.e., a separate weighted sum is computed for every dimension of the LSTM’s memory cell;

2. The weighted sum is accumulated with a dynamic program. This enables a linear rather than quadratic complexity in comparison to self-attention, but reduces the amount of parallel computation. This accumulation also creates an inductive bias of attending to nearby words, since the weights can only decrease over time.

3. Attention has a probabilistic interpretation due to the softmax normalization, while the sum of weights in LSTM can grow up to the sequence length. In variants of the LSTM that tie the input and forget gate, such as coupled-gate LSTM and GRU, the memory cell instead computes a weighted average with a probabilistic interpretation. These variants compute locally normalized distributions via a product of sigmoids rather than globally normalized distributions via a single softmax.

• “Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicate that LSTMs suffer little to no performance loss when removing the S-RNN. This provides evidence that the gating mechanism is doing the heavy lifting in modeling context. We further ablate the recurrence in each gate and find that this incurs only a modest drop in performance, indicating that the real modeling power of LSTMs stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allows us to mathematically relate LSTMs and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enables the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.”

In the recent language modeling domain, whereas ELMo employs stacked Bi-LSTM, and ULMFiT employs stacked LSTM [with no attention, shortcut connections (i.e., residual layers) or other sophisticated additions], OpenAI’s GPT: Generative Pre-Trained Transformer is a simple network architecture based solely on attention mechanisms that entirely dispenses with recurrence and convolutions, yet attains state of the art results. OpenAI’s GPT: Generative Pre-Trained Transformer is based on Google’s Transformer architecture. OpenAI GPT surpassed the state of the art on neural machine translation tasks, and generalized well to other tasks. The Transformer model, based entirely on attention, replaced RNN with a multi-head attention that consisted of multiple attention layers.

In July 2018, nearly a year after they introduced their original “Attention Is All You NeedTransformer architecture (Jun 2017; updated Dec 2017), Google Brain/DeepMind released an updated Universal Transformer version, discussed in the Google AI blog post Moving Beyond Translation with the Universal Transformer [Aug 2018;  discussion]:

[Image source. Click image to open in new window.]

[Image source (there is a more detailed schematic in Appendix A in that paper). Click image to open in new window.]

• “In Universal Transformer [code, described in Tensor2Tensor for Neural Machine Translation] we extend the standard Transformer to be computationally universal (Turing complete) using a novel, efficient flavor of parallel-in-time recurrence which yields stronger results across a wider range of tasks. We built on the parallel structure of Transformer to retain its fast training speed, but we replaced Transformer’s fixed stack of different transformation functions with several applications of a single, parallel-in-time recurrent transformation function (i.e. the same learned transformation function is applied to all symbols in parallel over multiple processing steps, where the output of each step feeds into the next).

“Crucially, where an RNN processes a sequence symbol-by-symbol (left to right), Universal Transformer processes all symbols at the same time (like the Transformer), but then refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention. This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNN, and also makes the Universal Transformer more powerful than the standard feedforward Transformer. …”

The performance benchmarks for Universal Transformer on the bAbI dataset (especially the more difficult “10k examples”) are particularly impressive ( in their paper; note also the MemN2N comparison). Appendix C shows the bAbI attention visualizations, of which the last example is particularly impressive (requiring three supportive facts to solve).

In August 2018 Google AI followed their Universal Transformers paper with Character-Level Language Modeling with Deeper Self-Attention [discussion], which showed that a deep (64-layer) Transformer model with fixed context outperformed RNN variants by a large margin, achieving state of the art on two popular benchmarks.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• LSTM and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts.

• While code is not yet released (2018-08-16), it will likely appear in Google’s TensorFlow tensor2tensor GitHub repository, “home” of their Transformer code.

• For reference, in Learning to Generate Reviews and Discovering Sentiment (Apr 2017) OpenAI also trained a byte (character)-level RNN-based language model (a single layer multiplicative LSTM with 4096 units trained for a single epoch on an Amazon product review dataset), that even with data-parallelism across 4 Pascal Titan X GPU for which training took approximately one month.

• However, RNN/CNN handle input sequences sequentially word-by-word which is an obstacle to parallelization. I am unsure how long it takes to train Google’s Transformer algorithm, which achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence, leading to significantly shorter training times (The Transformer – Attention is All You Need).

This GitHub Issue discusses parallelization over GPU and training times, etc., indicating that results are GPU-number and batch size dependent. The Annotated Transformer also discusses this; under their setup, (8 NVIDIA P100 GPU; parametrization; …) they trained the base models for a total of 100,000 steps (12 hrs); big models were trained for 300,000 steps (3.5 days).

Like Google, Facebook AI Research has also developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation (May 2018) [code/pretrained modelsdiscussion]), demonstrated via story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation).

• They tackled the challenges of story-telling with a hierarchical model, which first generated a sentence called “the prompt” (describing the topic for the story), and then “conditioned” on this prompt when generating the story. Conditioning on the prompt or premise made it easier to generate consistent stories, because they provided grounding for the overall plot. It also reduced the tendency of standard sequence models to drift off topic.

• To improve the relevance of the generated story to its prompt, they adopted the fusion mechanism from Cold Fusion: Training Seq2Seq Models Together with Language Models):

The cold fusion mechanism of Sriram et al. (2017) pretrains a language model and subsequently trains a seq2seq model with a gating mechanism that learns to leverage the final hidden layer of the language model during seq2seq training [their language model contained three layers of gated recurrent units (GRUs)]. The model showed, for the first time, that fusion mechanisms could help seq2seq models build dependencies between their input and output.

• To improve over the pretrained model, the second model had to focus on the link between the prompt and the story. Since existing convolutional architectures only encode a bounded amount of context, they introduced a novel gated self-attention mechanism that allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context).

• Similar to Google’s Transformer, Facebook AI Research used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (see Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with gated linear unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding (Aug 2018) proposed a new self-attention mechanism for sentence embedding, Dynamic Self-Attention (DSA). They designed DSA by modifying dynamic routing in capsule networks for use in NLP. DSA attended to informative words with a dynamic weight vector, achieving new state of the art results among sentence encoding methods on the Stanford Natural Language Inference (SNLI) dataset – with the least number of parameters – while showing comparative results in Stanford Sentiment Treebank (SST) dataset. With the dynamic weight vector, the self attention mechanism could be furnished with flexibility, rendering it more effective for sentence embedding.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Learning to Compose Neural Networks for Question Answering (Jun 2016) [codeauthor discussion] presented a compositional, attentional model for answering questions about a variety of world representations including images and structured knowledge bases. The model used natural language strings to automatically assemble neural networks from a collection of composable modules. Parameters for these modules were learned jointly with network-assembly parameters via reinforcement learning, with only (world, question, answer) triples as supervision. This approach, termed a Dynamic Neural Model Network, achieved state of the art results on benchmark datasets in both visual and structured domains. The model “translates” from questions to dynamically assembled neural networks, then applies these networks to world representations (images or knowledge bases) to produce answers. The model has two components, trained jointly: a collection of neural “modules” that can be freely composed, and a network layout predictor that assembles modules into complete deep networks tailored to each question (see their Figure 1). Training data consisted of (world, question, answer) triples: the approach required no supervision of the network layouts. They achieved state of the art performance on two markedly different question answering tasks: questions about natural images, and more compositional questions about United States geography.

[Image source. Click image to open in new window.]

Relevant to the following paragraph, in NLP parts of speech (POS) content words are words that name objects of reality and their qualities. They signify actual living things (dog, cat, etc.), family members (mother, father, sister, etc.), natural phenomena (snow, Sun, etc.) common actions (do, make, come, eat, etc.), characteristics (young, cold, dark, etc.), etc. Content words consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words. Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as prepositions (in, out, under, etc.), pronouns (I, you, he, who, etc.), conjunctions (and, but, till, as, etc.), etc.

Most models based on the seq2seq model with an encoder-decoder framework are equipped with an attention mechanism, like Google’s Transformer mechanism. However, these conventional attention mechanisms treat the decoding at each time step equally with the same matrix, which is problematic since the softness of the attention for different types of words (e.g. content words and function words) should differ. Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation (Aug 2018) [code: not yet available, 2018-10-10] addressed this issue, proposing a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an attention temperature. They set a temperature parameter which could be learned by the model based on the attentions in the previous decoding time steps, as well as the output of the decoder at the current time step. With the temperature parameter, the model was able to automatically tune the degree of softness of the distribution of the attention scores. Specifically, the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. In a neural machine translation task, they showed that SACT could attend to the most relevant elements in the source-side contexts, generating translation of high quality.

Attention and Memory:

• “We presented lightweight convolutions which perform competitively to the best reported results in the literature despite their simplicity. They have a very small parameter footprint and the kernel does not change over time-steps. This demonstrates that self-attention is not critical to achieve good accuracy on the language tasks we considered. Dynamic convolutions build on lightweight convolutions by predicting a different kernel at every time-step, similar to the attention weights computed by self-attention. The dynamic weights are a function of the current time-step only rather than the entire context. Our experiments show that lightweight convolutions can outperform a strong self-attention baseline on WMT’17 Chinese-English translation, IWSLT’14 German-English translation and CNN-DailyMail summarization. Dynamic convolutions improve further and achieve a new state of the art on the test set of WMT’14 English-German. Both lightweight convolution and dynamic convolution are 20% faster at runtime than self-attention. On Billion Word language modeling we achieve comparable results to self-attention,”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

“In order to learn effective features from temporal sequences, the long short-term memory (LSTM) network is widely applied. A critical component of LSTM is the memory cell, which is able to extract, process and store temporal information. Nevertheless, in LSTM, the memory cell is not directly enforced to pay attention to a part of the sequence. Alternatively, the attention mechanism can help to pay attention to specific information of data. In this paper, we present a novel neural model, called long short-term attention (LSTA), which seamlessly merges the attention mechanism into LSTM. More than processing long short term sequences, it can distill effective and valuable information from the sequences with the attention mechanism. Experiments show that LSTA achieves promising learning performance in various deep learning tasks.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Those [same authors] contemporaneously co-published this paper, Recurrent Attention Unit. [see also]

“Recurrent Neural Network (RNN) has been successfully applied in many sequence learning problems. Such as handwriting recognition, image description, natural language processing and video motion analysis. After years of development, researchers have improved the internal structure of the RNN and introduced many variants. Among others, Gated Recurrent Unit (GRU) is one of the most widely used RNN model. However, GRU lacks the capability of adaptively paying attention to certain regions or locations, so that it may cause information redundancy or loss during leaning. In this paper, we propose a RNN model, called Recurrent Attention Unit (RAU), which seamlessly integrates the attention mechanism into the interior of GRU by adding an attention gate. The attention gate can enhance GRU’s ability to remember long-term memory and help memory cells quickly discard unimportant content. RAU is capable of extracting information from the sequential data by adaptively selecting a sequence of regions or locations and pay more attention to the selected regions during learning. Extensive experiments on image classification, sentiment classification and language modeling show that RAU consistently outperforms GRU and other baseline methods.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• You May Not Need Attention (Oct 2018) [code;   author’s discussion]

“In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decoder. Our eager translation model is low-latency, writing target tokens as soon as it reads the first source token, and uses constant memory during decoding. It performs on par with the standard attention-based model of Bahdanau et al. (2014), and better on long sentences.”

[Image source. Click image to open in new window.]

• Convolutional Self-Attention Network (Oct 2018)

Self-attention network (SAN ) has recently attracted increasing interest due to its fully parallelized computation and flexibility in modeling dependencies. It can be further enhanced with multi-headed attention mechanism by allowing the model to jointly attend to information from different representation subspaces at different positions (Vaswani et al., 2017). In this work, we propose a novel convolutional self-attention network (CSAN), which offers SAN the abilities to (1) capture neighboring dependencies, and (2) model the interaction between multiple attention heads. Experimental results on WMT14 English-to-German translation task demonstrate that the proposed approach outperforms both the strong Transformer baseline and other existing works on enhancing the locality of SAN. Comparing with previous work, our model does not introduce any new parameters.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• “First derived from human intuition, later adapted to machine translation for automatic token alignment, attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this paper, we survey through recent works and conduct an introductory summary of the attention mechanism in different NLP problems, aiming to provide our readers with basic knowledge on this widely used method, discuss its different variants for different tasks, explore its association with other techniques in machine learning, and examine methods for evaluating its performance.”

#### Attention: Miscellaneous Applications

Although the following content is more NLP task related, I wanted to group this content close to the discussions of language models and attentional mechanisms, discussed in my “Attention and Memory subsection. Recent applications of Google’s “Transformer” and other attentional architectures relevant to this REVIEW include their use in NLP orientated tasks such as “slot filling” (relation extraction), question answering, and document summarization.

Position-aware Self-attention with Relative Positional Encodings for Slot Filling (Bilan and Roth, July 2018) applied self-attention with relative positional encodings to the task of relation extraction; their model relied solely on attention: no recurrent or convolutional layers were used. The authors employed Google’s Transformer seq2seq model, also known as as multi-head dot product attention (MHDPA) or “self-attention.”

• Despite citing Zhang et al. (Stanford University; coauthored by Christopher Manning)’s 2017 paper Position-aware Attention and Supervised Data Improve Slot Filling and using the TACRED relation extraction dataset introduced by Zhang et al. in their paper, Bilan and Roth claim

To the best of our knowledge, the transformer model has not yet been applied to relation classification as defined above (as selecting a relation for two given entities in context). ”

Furthermore, they provide no code, while Zhang et al. released their code, and included ablation studies in their work. The attention mechanism used by Zhang et al. differed significantly from the Google Transformer model in their use of the summary vector and position embeddings, and the way their attention weights were computed. While Zhang et al.’s $\small F_1$ scores (their Table 4) were slightly lower than Bilan and Roth’s on the TACRED dataset (see Bilan and Roth’s Table 1), the ensemble model used by Zhang et al. had the best scores. Sample relations extracted from a sentence are shown in Fig. 1 and Table 1 in Zhang et al.

[Click image to open in new window.]

Bidirectional Attention Flow for Machine Comprehension (Nov 2016; updated Jun 2018) introduced the BiDAF framework, a multi-stage hierarchical process that used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization. BiDAF was subsequently used in QA4IE: A Question Answering based Framework for Information Extraction (Apr 2018) (jointly discussed with the BiDAF paper discussion), a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, also using a knowledge base (Wikipedia Ontology) for entity recognition.

Related to attention-based relation extraction, Neural Architectures for Open-Type Relation Argument Extraction (Sep 2018) redefined the problem of slot filling to the task of Open-type Relation Argument Extraction (ORAE): given a corpus, a query entity $\small Q$ and a knowledge base relation (e.g.,”$\small Q$ authored notable work with title $\small X$”), the model had to extract an argument of “non-standard entity type” (entities that cannot be extracted by a standard named entity tagger) from the corpus – hence, “open-type argument extraction.” This work also employed the Transformer architecture, used as a multi-headed self-attention mechanism in their encoders for computing sentence representations suitable for argument extraction.

The approach for ORAE had two conceptual advantages. First, it was more general than slot-filling as it was also applicable to non-standard named entity types that could not be dealt with previously. Second, while the problem they defined was more difficult than standard slot filling, they eliminated an important source of errors: tagging errors that propagate throughout the pipeline and that are notoriously hard to correct downstream. A wide range of neural network architectures to solve ORAE were examined, each consisting of a sentence encoder, which computed a vector representation for every sentence position, and an argument extractor, which extracted the relation argument from that representation. The combination of a RNN encoder with a CRF extractor gave the best results, +7% absolute $\small \text{F-measure}$ better than a previously proposed adaptation of a state of the art question answering model (BiDAF). [“The dataset and code will be released upon publication.” – not available, 2018-10-10.]

[Image source. Click image to open in new window.]

Generating Wikipedia by Summarizing Long Sequences (Jan 2018), by Google Brain, employed Wikipedia in a supervised machine learning task for multi-document summarization, using extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. They modified their Transformer architecture to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. These improvements allowed them to generate entire Wikipedia articles.

Because the amount of text in input reference documents can be very large (see their Table 2) it was infeasible to train an end-to-end abstractive model, given the memory constraints of current hardware. Hence, they first coarsely selected a subset of the input using extractive summarization. The second stage involved training an abstractive model that generated the Wikipedia text while conditioning on this extraction. This two-stage process was inspired by by how humans might summarize multiple long documents: first highlighting pertinent information, then conditionally generating the summary based on the highlights.

Hierarchical Bi-Directional Attention-based RNNs for Supporting Document Classification on Protein-Protein Interactions Affected by Genetic Mutations (Jan 2018) [code] leveraged word embeddings trained on PubMed abstracts. The authors argued that the title of a paper usually contains important information that is more salient than a typical sentence in the abstract; they therefore proposed a shortcut connection (i.e., residual layer) that integrated the title vector representation directly to the final feature representation of the document. They concatenated the sentence vector that represented the title and the vectors of the abstract, to the document feature vector used as input to the task classifier. This system ranked first among the Document Triage Task of the BioCreative VI Precision Medicine Track.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Critique:

• The “spirit” of the BioCreative VI Track 4: Mining protein interactions and mutations for precision medicine (PM) is (bolded emphasis mine, below):

“The precision medicine initiative (PMI) promises to identify individualized treatment depending on a patients’ genetic profile and their related responses. In order to help health professionals and researchers in the precision medicine endeavor, one goal is to leverage the knowledge available in the scientific published literature and extract clinically useful information that links genes, mutations, and diseases to specialized treatments. … Understanding how allelic variation and genetic background influence the functionality of these pathways is crucial for predicting disease phenotypes and personalized therapeutical approaches. A crucial step is the mapping of gene products functional regions through the identification and study of mutations (naturally occurring or synthetically induced) affecting the stability and affinity of molecular interactions.”

• Against those criteria and despite the title of this paper and this excerpt from the paper,

“In order to incorporate domain knowledge in our system, we annotate all biomedical named entities namely genes, species, chemical, mutations and diseases. Each entity mention is surround by its corresponding tags as in the following example: Mutations in <species>human</species> <gene>EYA1</gene> cause <disease>branchio-oto-renal (BOR) syndrome</disease> …”

… there is no evidence that mutations (i.e. genomic variants) were actually tagged. Mutations/variants are not discussed, nor is there any mention of “mutant” or “mutation” in their GitHub repository/code nor the parent repo.

• Richard Socher and colleagues [SalesForce: A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks (Jul 2017)] also used shortcut connections (i.e., residual layers) from higher layers to lower layers (lower-level task predictions), reflecting linguistic hierarchies.

Identifying interactions between proteins is important to understanding underlying biological processes. Extracting protein-protein interactions (PPI) from raw text is often very difficult. Previous supervised learning methods have used handcrafted features on human-annotated data sets. Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2108) proposed a novel tree recurrent neural network with structured attention architecture for doing PPI. Their architecture achieved state of the art results on the benchmark AIMed and BioInfer data sets; moreover, their models achieved significant improvement over previous best models without any explicit feature extraction. Experimental results showed that traditional recurrent networks had inferior performance compared to tree recurrent networks, for the supervised PPI task.

“… we propose a novel neural net architecture for identifying protein-protein interactions from biomedical text using a Tree LSTM with structured attention. We provide an in depth analysis of traversing the dependency tree of a sentence through a child sum tree LSTM and at the same time learn this structural information through a parent selection mechanism by modeling non-projective dependency trees.”

“… The attention mechanism has been a breakthrough in neural machine translation (NMT) in recent years. This mechanism calculates how much attention the network should give to each source word to generate a specific translated word. The context vector calculated by the attention mechanism mimics the syntactic skeleton of the input sentence precisely given a sufficient number of examples. Recent work suggests that incorporating explicit syntax alleviates the burden of modeling grammatical understanding and semantic knowledge from the model.”

[Image source (Kai Sheng Tai, Richard Socher, Christopher D. Manning). Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

### Pointer (Pointer-Generator) Mechanisms

Pointer networks, introduced by Oriol Vinyals et al. in Pointer Networks (Jun 2015; updated Jan 2017), learned the conditional probability of an output sequence with elements that were discrete tokens corresponding to positions in an input sequence.  “Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net …”

[Image source. Click image to open in new window.]

While pointers (pointer networks; pointer mechanisms) were an important advance – providing, for example, one potential solution for rare and out of vocabulary (OOV) words – they suffered from two limitations. First, they were unable to select words that did not exist in the input sequence. Second, there was no option to choose whether to point or not: it always points.

Extending pointer networks, pointer-generator architectures can copy words from source texts via a pointer , and generate novel words from a vocabulary via a generator. With the pointing/copying mechanism factual information can be reproduced accurately, and out of vocabulary words can also be taken care of in the summaries.

Pointer / pointer-generator mechanisms have been used in the following works.

Nallapati et al. [Caglar Gulcehre] Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (Aug 2016) modeled abstractive text summarization using attentional encoder-decoder recurrent neural networks, achieving state of the art performance on two different corpora. They proposed several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-to-word structure, and emitting words that are rare or unseen at training time. …

• “Often-times in summarization, the keywords or named-entities in a test document that are central to the summary may actually be unseen or rare with respect to training data. Since the vocabulary of the decoder is fixed at training time, it cannot emit these unseen words. Instead, a most common way of handling these out-of-vocabulary (OOV) words is to emit an ‘UNK’ token as a placeholder. However this does not result in legible summaries. In summarization, an intuitive way to handle such OOV words is to simply point to their location in the source document instead. We model this notion using our novel switching decoder/pointer architecture which is graphically represented in Figure 2. In this model, the decoder is equipped with a ‘switch’ [modeled as a (softmax) sigmoid activation function over a linear layer based on the entire available context at each time step] that decides between using the generator or a pointer at every time-step. If the switch is turned on, the decoder produces a word from its target vocabulary in the normal fashion. However, if the switch is turned off, the decoder instead generates a pointer to one of the word-positions in the source. The word at the pointer-location is then copied into the summary. The switch is modeled as a sigmoid activation function over a linear layer based on the entire available context at each time-step as shown below. …”

• “The pointer mechanism may be more robust in handling rare words because it uses the encoder’s hidden-state representation of rare words to decide which word from the document to point to. Since the hidden state depends on the entire context of the word, the model is able to accurately point to unseen words although they do not appear in the target vocabulary. [Even when the word does not exist in the source vocabulary, the pointer model may still be able to identify the correct position of the word in the source since it takes into account the contextual representation of the corresponding ‘UNK’ token encoded by the RNN. Once the position is known, the corresponding token from the source document can be displayed in the summary even when it is not part of the training vocabulary either on the source side or the target side.] …”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Gulcehre et al. [Yoshua Bengio], Pointing the Unknown Words (Aug 2016) proposed a novel way to deal with rare and unseen words for the neural network models using attention. Their model used two softmax layers to predict the next word in conditional language models: one predicted the location of a word in the source sentence, and the other predicted a word in the shortlist vocabulary. At each time-step, the decision of which softmax layer to use choose was made adaptively by an MLP conditioned on the context. …

• “The attention-based pointing mechanism is introduced first in the pointer networks (Vinyals et al., 2015). In the pointer networks, the output space of the target sequence is constrained to be the observations in the input sequence (not the input space). Instead of having a fixed dimension softmax output layer, softmax outputs of varying dimension is dynamically computed for each input sequence in such a way to maximize the attention probability of the target input. However, its applicability is rather limited because, unlike our model, there is no option to choose whether to point or not; it always points. In this sense, we can see the pointer networks as a special case of our model where we always choose to point a context word.”

Merity et al. [Richard Socher | MetaMind/Salesforce] Pointer Sentinel Mixture Models (Sep 2016) introduced the pointer sentinel mixture architecture for neural sequence models, which had the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. …

• “Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM.”

• “Pointer networks (Vinyals et al., 2015) provide one potential solution for rare and out of vocabulary (OOV) words as a pointer network uses attention to select an element from the input as output. This allows it to produce previously unseen input tokens. While pointer networks improve performance on rare words and long-term dependencies they are unable to select words that do not exist in the input.

“We introduce a mixture model, illustrated in Fig. 1, that combines the advantages of standard softmax classifiers with those of a pointer component for effective and efficient language modeling. Rather than relying on the RNN hidden state to decide when to use the pointer, as in the recent work of Gulcehre et al. (2016), we allow the pointer component itself to decide when to use the softmax vocabulary through a sentinel. The model improves the state of the art perplexity on the Penn Treebank. Since this commonly used dataset is small and no other freely available alternative exists that allows for learning long range dependencies, we also introduce a new benchmark dataset for language modeling called WikiText  [WikiText-103].”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. See et al. [Christopher Manning] Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017) proposed a novel architecture that augmented the standard sequence-to-sequence attentional model in two orthogonal ways. First, they used a hybrid pointer-generator network that could copy words from the source text via pointing, which aided accurate reproduction of information while retaining the ability to produce novel words through the generator. Second, they used coverage to keep track of what had been summarized, which discouraged repetition.

• “The pointer network (Vinyals et al., 2015) is a sequence-to-sequence model that uses the soft attention distribution of Bahdanau et al. (2015) to produce an output sequence consisting of elements from the input sequence. … Our approach is considerably different from that of Gulcehre et al. (2016)Nallapati et al. (2016). Those works train their pointer components to activate only for out-of-vocabulary words or named entities (whereas we allow our model to freely learn when to use the pointer), and they do not mix the probabilities from the copy distribution and the vocabulary distribution. We believe the mixture approach described here is better for abstractive summarization – in section 6 we show that the copy mechanism is vital for accurately reproducing rare but in-vocabulary words, and in section 7.2 we observe that the mixture model enables the language model and copy mechanism to work together to perform abstractive copying.”

• “Our hybrid pointer-generator network facilitates copying words from the source text via pointing (Vinyals et al., 2015), which improves accuracy and handling of OOV words, while retaining the ability to generate new words. The network, which can be viewed as a balance between extractive and abstractive approaches, is similar to Gu et al.’s (2016) and Miao and Blunsom’s (2016) , that were applied to short-text summarization. We propose a novel variant of the (Tu et al., 2016) from Neural Machine Translation, which we use to track and control coverage of the source document. We show that coverage is remarkably effective for eliminating repetition.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. Paulus et al. [Richard Socher] A Deep Reinforced Model for Abstractive Summarization (Nov 2017) introduced a neural network model with a novel intra-attention that attended over the input and continuously generated output separately, and a new training method that combined standard supervised word prediction and reinforcement learning. Models trained only with supervised learning often exhibited ‘exposure bias’ – they assumed ground truth was provided at each step during training. However, when standard word prediction was combined with the global sequence prediction training of reinforcement learning, the resulting summaries became more readable.

• “To generate a token, our decoder uses either a token-generation softmax layer or a pointer mechanism to copy rare or unseen from the input sequence. We use a switch function that decides at each decoding step whether to use the token generation or the pointer (Gulcehre et al., 2016Nallapati et al., 2016).”

• “Neural Encoder-Decoder Sequence Models. Neural encoder-decoder models are widely used in NLP applications such as machine translation, summarization, and question answering. These models use recurrent neural networks (RNN), such as long-short term memory network (LSTM) to encode an input sentence into a fixed vector, and create a new output sequence from that vector using another RNN. To apply this sequence-to-sequence approach to natural language, word embeddings are used to convert language tokens to vectors that can be used as inputs for these networks.

Attention mechanisms make these models more performant and scalable, allowing them to look back at parts of the encoded input sequence while the output is generated. These models often use a fixed input and output vocabulary, which prevents them from learning representations for new words. One way to fix this is to allow the decoder network to point back to some specific words or sub-sequences of the input and copy them onto the output sequence. Gulcehre et al. (2016) and Merity et al. (2016) combine this pointer mechanism with the original word generation layer in the decoder to allow the model to use either method at each decoding step.

Follow-on work (2018) by Socher and colleagues, Improving Abstraction in Text Summarization (Aug 2018), proposed two techniques to improve the level of abstraction of generated summaries. First, they decomposed the decoder into a contextual network that retrieved relevant parts of the source document, and a pretrained language model that incorporated prior knowledge about language generation. The decoder generated tokens by interpolating between selecting words from the source document via a pointer network as well as selecting words from a fixed output vocabulary. The contextual network had the sole responsibility of extracting and compacting the source document, whereas the language model was responsible for the generation of concise paraphrases. Second, they proposed a novelty metric that was optimized directly through policy learning (a reinforcement learning reward) to encourage the generation of novel phrases (summary abstraction).

[Image source. Click image to open in new window.]

Neural Abstractive Text Summarization with Sequence-to-Sequence Models (Dec 2018) provides an excellent review, that includes pointer-generator mechanisms.

### Attention vs. Pointer Mechanisms

From Attention Is All You Need (Jun 2017; updated Dec 2017):

• “An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
• “Rather than using attention to blend hidden units of an encoder into a context vector (see Fig. 8), the Pointer Net  [i.e., pointer] applies attention over the input elements to pick one as the output at each decoder step. … The attention mechanism is simplified, as Ptr-Net does not blend the encoder states into the output with attention weights. In this way, the output only responds to the positions but not the input content.

[Image source. Click image to open in new window.]

From See et al. [Christopher Manning] Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017):

• “Our hybrid pointer-generator network facilitates copying words from the source text via pointing (Vinyals et al., 2015), which improves accuracy and handling of our-of-vocabulary (OOV) words, while retaining the ability to generate new words.”

“Our baseline model is similar to that of Nallapati et al. (2016), and is depicted in Figure 2. The tokens of the article $\small w_i$ are fed one-by-one into the encoder (a single-layer bidirectional LSTM), producing a sequence of encoder hidden states $\small h_i$. On each step $\small t$, the decoder (a single-layer unidirectional LSTM) receives the word embedding of the previous word (while training, this is the previous word of the reference summary; at test time it is the previous word emitted by the decoder), and has decoder state $\small s_t$. The attention distribution $\small a^t$ is calculated as in Bahdanau et al. (2015):

$\small e^t_i = v^T \text{tanh} (W_hh_i + W_ss_t + b_{attn})$
$\small at = \text{softmax} (e^t)$

where $\small v,W_h,W_s$ and $\small b_{attn}$ are learnable parameters.

[Image source. Click image to open in new window.]

“The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $\small h^∗_t$:

\small \begin{align} h^*_t = \sum_i a^t_i h_i \end{align}.

The context vector, which can be seen as a fixed-size representation of what has been read from the source for this step, is concatenated with the decoder state $\small s_t$ and fed through two linear layers to produce the vocabulary distribution $\small P_{vocab}$:

$\small P_{vocab} = \text{softmax}(V'(V[s_t,h^∗_t] + b) + b')$

where $\small V, V’, b$ and $\small b’$ are learnable parameters. $\small P_{vocab}$ is a probability distribution over all words in the vocabulary, and provides us with our final distribution from which to predict words $\small w$:

$\small P(w) =P_{vocab}(w)$.

During training, the loss for timestep $\small t$ is the negative log likelihood of the target word $\small w^∗_t$ for that timestep:

$\small loss_t = -\text{log}P(w^∗_t)$

and the overall loss for the whole sequence is:

\small \begin{align} loss = \frac{1}{T} \sum^T_{t=0} \text{loss}_t \end{align}."

• Pointer-generator networks.  Our pointer-generator network is a hybrid between our baseline and a pointer network (Vinyals et al., 2015), as it allows both copying words via pointing, and generating words from a fixed vocabulary. In the pointer-generator model (depicted in Figure 3) the attention distribution $\small a^t$ and context vector $\small h^∗_t$ are calculated as in Section 2.1. In addition, the generation probability $\small p_{gen} \in [0,1]$ for timestep $\small t$ is calculated from the context vector $\small h^∗_t$, the decoder state $\small s_t$ and the decoder input $\small x_t$:

$p_{gen} = \sigma (w^T_{h^∗} h^∗_t + w^T_s s_t + w^T_x x_t + b_{ptr})$

where vectors $w_{h^∗}$, $\small w_s$, $\small w_x$ and scalar $\small b_{ptr}$ are learnable parameters and $\small \sigma$ is the sigmoid function. Next, $\small p_{gen}$ is used as a soft switch to choose between generating a word from the vocabulary by sampling from $\small P_{vocab}$, or copying a word from the input sequence by sampling from the attention distribution $\small a^t$. For each document let the extended vocabulary denote the union of the vocabulary, and all words appearing in the source document. We obtain the following probability distribution over the extended vocabulary:

\begin{align} P(w) = p_{gen} P_{vocab}(w) + (1-p_{gen}) \sum_{i:w_{i}=w} a^t_i \end{align}     (9)

Note that if $\small w$ is an out-of-vocabulary (OOV) word, then $\small P_{vocab}(w)$ is zero; similarly if $\small w$ does not appear in the source document, then \begin{align} \sum_{i:w_{i}=w} a^t_i \end{align} is zero. The ability to produce OOV words is one of the primary advantages of pointer-generator models; by contrast models such as our baseline are restricted to their preset vocabulary. The loss function is as described in equations (6) and (7), but with respect to our modified probability distribution $\small P(w)$ given in equation (9).

[Image source. Click image to open in new window.]

Coverage mechanism.  Repetition is a common problem for sequence-to-sequence models (Tu et al., 2016;  Mi et al., 2016;  Sankaran et al., 2016;  Suzuki and Nagata, 2016), and is especially pronounced when generating multi-sentence text (see Figure 1). We adapt the coverage model of Tu et al. (2016) to solve the problem.”

• “The pointer network (Vinyals et al., 2015) is a sequence-to-sequence model that uses the soft attention distribution of Bahdanau et al. (2015) to produce an output sequence consisting of elements from the input sequence. The pointer network has been used to create hybrid approaches for NMT (neural machine translation; Gulcehre et al., 2016), language modeling (Merity et al., 2016), and summarization (Gu et al., 2016Gulcehre et al., 2016Miao and Blunsom, 2016Nallapati et al. (2016); Zeng et al., 2016).”

• “Our approach is considerably different from that of Gulcehre et al. (2016) and Nallapati et al. (2016). Those works train their pointer components to activate only for out-of-vocabulary words or named entities (whereas we allow our model to freely learn when to use the pointer), and they do not mix the probabilities from the copy distribution and the vocabulary distribution. We believe the mixture approach described here is better for abstractive summarization – in Section 6 we show that the copy mechanism is vital for accurately reproducing rare but in-vocabulary words, and in Section 7.2 we observe that the mixture model enables the language model and copy mechanism to work together to perform abstractive copying.”

## Language Models

A recent and particularly exciting advance in NLP is the development of pretrained language models such as

• ELMo  (released in February 2018, by Allen NLP),
• ULMFiT  (May 2018, by fast.ai and Aylien Ltd.),
• GPT  (June 2018, by OpenAI),
• GPT-2  (February 2019, by OpenAI),
• BERT  (October 2018, by Google AI Language), and
• BioBERT  (Jan 2019, by Korea University).

Prior to 2018 almost all state of the art NLP solutions were highly specialized task-specific architectures. In contrast, in 2018 several task-agnostic architectures were published that achieved state of the art results across a wide range of competitive tasks, for the first time suggesting generalizable language understanding. All of these systems built on the language modeling objective: training a model to predict a word given its surrounding context. Because training examples were built from unlabeled corpora, much training data was available. ELMo trained representations with stacked bidirectional LSTMs, but still employed task-specific architectures on top of them. OpenAI GPT and BERT did away with this and instead trained task-agnostic Transformer (attentional) stacks that were fine-tuned together with a single dense layer for each downstream task. The latter mainly improved upon the former by joint conditioning on both preceding and following contexts. Critically, all systems allowed for contextualized word representations: they mapped each word occurrence to a vector, specifically considering the surrounding context. Much of their success was attributed to the ability to better disambiguate polysemous words in a given sentence. This representation approach is easily applicable for many NLP tasks, where inputs are usually sentences and context information is thus available. [Source: Section 2.2 in Learning Taxonomies of Concepts and not Words using Contextualized Word Representations: A Position Paper.]

Briefly summarizing the architectures associated with those models:

• ELMo uses a shallow concatenation of independently trained left-to-right and right-to-left multilayer LSTMs – two (stacked) Bi-LSTM layers.

• ULMFiT  is a transfer learning method that can be applied to text classification tasks. ULMFiT  employed a AWD-LSTM (ASGD Weight-Dropped LSTM), a weight-dropped LSTM that used a variant of averaged stochastic gradient descent (ASGD).

• OpenAI GPT employed Google’s Transformer architecture (a seq2seq based self-attention mechanism in place of RNNs (e.g., the stacked Bi-LSTMs used in ELMo).

• The model architectures are different: ELMo uses a shallow concatenation of independently trained left-to-right and right-to-left multi-layer LSTMs (Bi-LSTMs), while GPT is a multi-layer transformer decoder.

• The use of contextualized embeddings in downstream tasks are different: ELMo feeds embeddings into models customized for specific tasks as additional features, while GPT fine-tunes the same base model for all end tasks.

• Generative pre-trained LM + task-specific fine-tuning has been proved to work in ULMFiT, where the fine-tuning happens in all layers gradually. ULMFiT focuses on training techniques for stabilizing the fine-tuning process.

• BERT employed a deeply bidirectional, unsupervised language representation, pretrained using only a plain text corpus: Wikipedia.

• The architecture employed by BERT is a bidirectional Transformer encoder, which demonstrates training efficiency and superior performance in capturing long-distance dependencies compared to a RNN architecture.

• The bidirectional encoder is a standout feature that differentiates BERT from:

• OpenAI GPT, a left-to-right Transformer; and,

• ELMo, a concatenation of independently trained left-to-right and right-to-left LSTMs.

[Image source. Click image to open in new window.]

Those language models demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks. [However, note (Kaiming He et al.) Rethinking ImageNet Pre-Training (Nov 2018).]

Character based LM:

[Image source. Click image to open in new window.]

ImageNet [datasetcommunity challengesdiscussion  (local copy)] and related image classification, segmentation, and captioning challenges have had an enormous impact on the advancement of computer vision, deep learning, deep learning architecture, the use of pretrained modelstransfer learningattentional mechanisms, etc. Studies arising from the ImageNet dataset have as well identified gaps in our understanding of that success, leading to demystifying how those deep neural networks classify images (explained very well in this excellent video, below), and other issues including the vulnerability of deep learning to the impact of adversarial attacks. It is anticipated that pretrained language models will have a parallel impact, in the NLP domain.

The availability of pretrained models is an important and practical advance in machine learning, as many of the current processing tasks in image processing and NLP language modeling are extremely computationally intensive. For example:

### ELMo

ELMo (“Embeddings from Language Models”) was introduced in Deep Contextualized Word Representations (Feb 2018; updated Mar 2018) [project;  tutorials here and here;  discussion here, here and here] by authors at the Allen Institute for Artificial Intelligence and the Paul G. Allen School of Computer Science & Engineering at the University of Washington. ELMo modeled both the complex characteristics of word use (e.g., syntax and semantics), and how these characteristics varied across linguistic contexts (e.g., to model polysemy: words or phrases with different, but related, meanings). These word vectors were learned functions of the internal states of a deep bidirectional language model (two Bi-LSTM layers), which was pretrained on a large text corpus.

Unlike most widely used word embeddings, ELMo word representations were deep, in that they were a function of the internal, hidden layers of the bi-directional Language Model (biLM), providing a very rich representation. More specifically, ELMo learned a linear combination of the vectors stacked above each input word for each end task, which markedly improved performance over using just the top LSTM layer. These word vectors could be easily added to existing models, significantly improving the state of the art across a broad range of challenging NLP problems including question answering, recognizing textual entailment (i.e., natural language inference),  semantic role labelingcoreference resolutionnamed entity extraction, and sentiment analysis. The addition of ELMo representations alone significantly improved the state of the art in every case, including up to 20% relative error reductions.

[Image source (based on data from Table 1 in arXiv:1802.05365. Click image to open in new window.]
Tasks: SQuAD: question answering; SNLI: textual entailment; SRL: semantic role labeling; Coref: coreference resolution; NER: named entity recognition; SST-5: sentiment analysis. SOTA: state of the art.

### ULMFiT

Jeremy Howard (fast.ai and the University of San Francisco) and Sebastian Ruder (Insight Centre for Data Analytics, NUI Galway, and Aylien Ltd.) described their Universal Language Model Fine-tuning (ULMFiT) model in Universal Language Model Fine-tuning for Text Classification (Jan 2018; updated May 2018) [project/code;  code here  herehere and here;  discussion herehereherehereherehere and here].

ULMFiT  employed the AWD-LSTM (ASGD Weight-Dropped LSTM) model described by Merity et al. (Richard Socher) in Regularizing and Optimizing LSTM Language Models (Aug 2017;  code). AWD-LSTM is a weight-dropped LSTM (a regular LSTM with no attention, short-cut connections, or other sophisticated additions), which used DropConnect (dropped network connections; local copy) on hidden-to-hidden weights as a form of recurrent regularization. AWD-LSTM also uses NT-ASGD (Merity et al., 2017, a variant of averaged stochastic gradient descent (ASGD) in which the averaging trigger is determined using a non-monotonic condition, as opposed to being tuned by the user. [ASGD carries out iterations similar to SGD, but instead of returning the last iterate as the solution, returns an average of the iterates past a certain, user tuned, threshold $\small T$ hyperparameter.  NT-ASGD is a non-monotonically triggered variant of ASGD, which obviates the need for tuning $\small T\$.]

ULMFiT  is a transfer learning method that could be applied to any task in NLP [but as of July 2018 they had only studied its use in classification tasks] as well as key techniques for fine-tuning a language model. They also provided the fastai.text and fastai.lm_rnn modules necessary to train/use their ULMFiT models.

ULMFiT significantly outperformed the state of the art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, ULMFiT matched the performance of training from scratch on 100x more data.*

[Image source. Click image to open in new window.]

### OpenAI GPT: Generative Pre-Trained Transformer

OpenAI’s GPT: Generative Pre-Trained Transformer  (Radford et al., Improving Language Understanding by Generative Pre-Training) (Jun 2018) [projectcode;  discussion: here and here] was introduced by Ilya Sutskever and colleagues at OpenAI. They demonstrated that large gains on diverse natural language understanding (NLU) tasks – such as textual entailment, question answering, semantic similarity assessment, and document classification – could be realized by a two stage training procedure: generative pretraining of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, they made use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

The OpenAI GPT (Generative Pre-Trained Transformer)  [aka: Finetuned Transformer LM ] provided a convincing example that pairing supervised learning methods with unsupervised pretraining works very well, demonstrating the effectiveness of their approach on a wide range of NLU benchmarks. Their general task-agnostic model outperformed discriminatively trained models that used architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied (see the at Improving Language Understanding with Unsupervised Learning). For instance, they achieved absolute improvements of 8.9% on commonsense reasoning, 5.7% on question answering, and 1.5% on textual entailment (natural language inference).

[Image source. Click image to open in new window.]

• The architecture employed in Improving Language Understanding by Generative Pre-Training, OpenAI GPT: Generative Pre-Trained Transformer, was Google’s Transformer, a seq2seq based self-attention mechanism. This model choice provided OpenAI with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, they utilized task-specific input adaptations derived from traversal-style approaches, which processed structured text input as a single contiguous sequence of tokens. As they demonstrated in their experiments, these adaptations enabled them to fine-tune effectively with minimal changes to the architecture of the pretrained model. OpenAI’s GPT model largely followed the original (Google’s Attention Is All You Need) Transformer work: OpenAI trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads) …

• The Transformer architecture does not use RNN (LSTM), relying instead on the use of the self-attention mechanism. In Improving Language Understanding by Generative Pre-Training, the OpenAI authors asserted that the use of LSTM models employed in ELMo and ULMFiT restricted the prediction ability of those language models to a short range. In contrast, OpenAI’s choice of Transformer networks allowed them to capture longer-range linguistic structure. Regarding better understanding of why the pretraining of language models by Transformer architectures was effective, they hypothesized that the underlying generative model learned to perform many of the evaluated tasks in order to improve its language modeling capability, and that the more structured attentional memory of the Transformer assisted in transfer, compared to LSTM.

Language Models are Unsupervised Multitask Learners (Radford et al.  |  OpenAI: Feb 2019) released GPT-2, a successor to GPT, trained (simply) on 40GB of Internet text to predict the next word.

• Abstract.  “Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 $\small F_1$ on the CoQA dataset – matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2 , is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.”

• code
• OpenAI blog post [local copy]
• Discussion: reddit:MachineLearning  |  reddit:LanguageTechnology  |  Hacker News  |  twitter;  |  OpenAI Guards Its ML Model Code & Data to Thwart Malicious Usage  |  Hacker News  |  Some thoughts on zero-day threats in AI, and OpenAI’s GP2  |  yadda yadda (yawn; mostly related to OpenAI's hype)

• Takeaways:

• OpenAI trained models of four different sizes (small, medium, large, and extra large). GPT-2 was the largest language model and had 1.542 billion parameters, 12 times larger than GPT,  trained on ~10 times the amount of data. In comparison, Google’s language model BERT has 340 million parameters.

• GPT-2  is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains.

• The new model was trained on a new (unreleased) dataset called “WebText” (details in the paper) which contains eight million documents (40 GB of web page text) extracted from the social media platform Reddit. The model was designed for general purpose (not domain specific) synthesis, so it might underperform on domain specific texts.  [In this regard, note the improvement on biomedical NLP tasks by BioBERT (Jan 2019) – a language representation model for biomedical text mining pre-trained on PubMed/PubMed Central – over BERT and other state of the art baselines.]

• “We created a new dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated/filtered by humans - specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl.” [Source]

• The new model follows the same architecture as the GPT model  –  a left-to-right Transformer  –  with a few modifications. [Transformer employs a self-attention mechanism, proven well suited for language understanding tasks.] OpenAI Researcher Alec Radford told Synced that novel techniques include pre-activation, zero domain transfer, and zero task transfer.

• OpenAI trained a large scale unsupervised language model which:

• generated coherent paragraphs of text: GPT-2  generates synthetic text samples in response to the model being primed with an arbitrary input.

• GPT-2  displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2  outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2  begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

• achieved state of the art performance on seven of eight traditional language modeling benchmarks without any modifications or fine-tuning; and,

• performed rudimentary reading comprehension, machine translation, question answering, and summarization – all without task specific training (direct supervised learning).

### BERT

In October 2018 Google Language AI (Devlin et al.) presented BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [code: Google Research | author’s discussion]. Unlike recent language representation models, BERT – which stands for Bidirectional Encoder Representations from Transformers – is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, pre-trained BERT representations can be fine-tuned with just one additional output layer to create state of the art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT obtained new state of the art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering test $\small F_1$ to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

• [Google AI Blog: Nov 2018 – short, very descriptive summary (local copy)] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

“… What Makes BERT  Different? BERT builds upon recent work in pre-training contextual representations - including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia). …”

“…The Strength of Bidirectionality, If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model. To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:

[Image source. Click image to open in new window.]

“While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network. BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:

[Image source. Click image to open in new window.]

“… On SQuAD v1.1, BERT achieves 93.2% $\small F_1$ score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%. … BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them. …”

• Community discussion herehereherehere, and here: Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks

Applications:

• BioBERT: Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining (Jaewoo Kang and Colleagues at Korea University: Jan 2019) [codepretrained model;   discussion]  |  “… BioBERT effectively transfers the knowledge of large amount of biomedical texts into biomedical text mining models. While BERT also shows competitive performances with previous state-of-the-art models, BioBERT significantly outperforms them on three representative biomedical text mining tasks including biomedical named entity recognition (1.86% absolute improvement), biomedical relation extraction (3.33% absolute improvement), and biomedical question answering (9.61% absolute improvement) with minimal task-specific architecture modifications. …”

• Multi-Task Deep Neural Networks for Natural Language Understanding (Microsoft Research, Jan 2019) [“code and pretrained models will be made publicly available;” their implementation of MT-DNN is based on this PyTorch implementation of BERT, parametrized as described in Section 4.2] “… we present a Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks. … MT-DNN extends the model proposed in Liu et al. (2015) by incorporating a pre-trained bidirectional transformer language model, known as BERT (Devlin et al., 2018). MT-DNN obtains new state-of-the-art results on ten NLU tasks, including SNLISciTail,  and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.2% (1.8% absolute improvement). We also demonstrate using the SNLI and SciTail datasets that the representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• End-to-End Open-Domain Question Answering with BERTserini (Feb 2019).  “We demonstrate an end-to-end question answering system that integrates BERT with the open-source Anserini information retrieval toolkit.”

[Image source. Click image to open in new window.]
• Anserini is an open-source information retrieval toolkit built on Lucene that aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications. This effort grew out of a reproducibility study of various open-source retrieval engines in 2016 (Lin et al., 2016 ). Additional details can be found in a short paper (Yang et al., 2017) and a journal article (Yang et al.,2018).

• BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning (Feb 2019) explored the multi-task learning setting for BERT model on the GLUE benchmark, and how to best add task-specific parameters to a pre-trained BERT network, with a high degree of parameter sharing between tasks. They introduced new adaptation modules, PALs or projected attention layers, which used a low-dimensional multi-head attention mechanism – based on the idea that it was important to include layers with inductive biases useful for the input domain. By using PALs in parallel with BERT layers, they matched the performance of fine-tuned BERT on the GLUE benchmark with roughly 7 times fewer parameters, and obtained state of the art results on the Recognizing Textual Entailment dataset.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model (Kyunghyun Cho; Feb 2019) [codenotebook]. “We show that BERT (Devlin et al., 2018) is a Markov random field language model. Formulating BERT in this way gives way to a natural procedure to sample sentence from BERT. We sample sentences from BERT and find that it can produce high-quality, fluent generations. Compared to the generations of a traditional left-to-right language model, BERT generates sentences that are more diverse but of slightly worse quality.”

Non-author code:

[Image source. BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In the following figure, note that the results in Table 2 were on the less-challenging (viz-a-viz SQuAD2.0) SQuAD1.1 QA dataset:

[Image source. Click image to open in new window.]

Some highlights, excerpted from Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks:

• NLP researchers are exploiting today’s large amount of available language data and maturing transfer learning techniques to develop novel pre-training approaches. They first train a model architecture on one language modeling objective, and then fine-tune it for a supervised downstream task. Aylien Research Scientist Sebastian Ruder suggests in his blog that pre-trained models may have “the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.”

• The BERT model architecture is a bidirectional Transformer encoder. The use of [Google’s] Transformer comes as no surprise – this is a recent trend due Transformers’ training efficiency and superior performance in capturing long-distance dependencies compared to a recurrent neural network architecture. The bidirectional encoder meanwhile is a standout feature that differentiates BERT from OpenAI GPT: Generative Pre-Trained Transformer [i.e. OpenAI Transformer – a left-to-right Transformer] and ELMo (a concatenation of independently trained left-to-right and right- to-left LSTM).

• BERT is a huge model, with 24 Transformer blocks, 1024 hidden layers, and 340M parameters.

• The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). The model runs on 16 TPU pods for training.

• In the pre-training process, researchers took an approach which involved randomly masking a percentage of the input tokens (15 percent) to train a deep bidirectional representation. They refer to this method as a Masked Language Model (MLM).

• A pre-trained language model cannot understand relationships between sentences, which is vital to language tasks such as question answering and natural language inferencing. Researchers therefore pre-trained a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.

• The fine-tuned model for different datasets improves the GLUE benchmark to 80.4 percent (7.6 percent absolute improvement), MultiNLI accuracy to 86.7 percent (5.6 percent absolute improvement), the SQuAD1.1 question answering test $\small F_1$ to 93.2 (1.5 absolute improvement), and so on over a total of 11 language tasks.

Passage Re-ranking with BERT (Jan 2019) [code], by Kyunghyun Cho and colleague, described a simple reimplementation of BERT for query-based passage reranking. Their system was the state of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% (relative) in MRR@10.

• “A simple question-answering pipeline consists of three main stages. First, a large number (for example, a thousand) of possibly relevant documents to a given question are retrieved from a corpus by a standard mechanism, such as BM25. In the second stage, passage re-ranking, each of these documents is scored and re-ranked by a more computationally-intensive method. Finally, the top ten or fifty of these documents will be the source for the candidate answers by an answer generation module. In this paper, we describe how we implemented the second stage of this pipeline, passage re-ranking.”

[Image source. Click image to open in new window.]

Assessing BERT’s Syntactic Abilities (Jan 2019) [code] assessed the extent to which BERT captured English syntactic phenomena, using (1) naturally-occurring subject-verb agreement stimuli; (2) “coloreless green ideas” subject-verb agreement stimuli, in which content words in natural sentences are randomly replaced with words sharing the same part-of-speech and inflection; and (3) manually crafted stimuli for subject-verb agreement and reflexive anaphora phenomena. In each case BERT performed remarkably well: the out of the box models (without any task-specific fine-tuning) performed very well on all the syntactic tasks.

[Image source. Click image to open in new window.]

### Transformer

In mid-2017, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The inherently sequential nature of RNNs precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. In June 2017 Vaswani et al. at Google proposed a new simple network architecture, Transformer, that was based solely on attention mechanisms – entirely dispensing with recurrence and convolutions, and allowing significantly more parallelization (Attention Is All You Need (Jun 2017; updated Dec 2017) [code]). Transformer has been shown to perform strongly on machine translation, document generation, syntactic parsing and other tasks. Experiments on two machine translation tasks showed these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Transformer also generalized well to other tasks; for example, it was successfully applied to English constituency parsing, both with large and limited training data.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Transformer is discussed in Google AI’s August 2017 blog post Transformer: A Novel Neural Network Architecture for Language Understanding:

• “… The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations. The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.”

[Image source. Click image to open in new window.]

• In the literature, Google’s Transformer is also referred to as Multi-head dot product attention (MHDPA), and “self-attention. ”

• Due to the absence of recurrent layers in the model, the Transformer model trained significantly faster and outperformed all previously reported ensembles.

• Alexander Rush at HarvardNLP provides an excellent web page, The Annotated Transformer, complete with discussion and code (an “annotated” version of the paper in the form of a line-by-line implementation) [papercode]!

• Discussion:
• User implementations:
• Attention Is All You Need coauthor Łukasz Kaiser posted slides describing this work (Tensor2Tensor Transformers New Deep Models for NLP  [local copydiscussion].

Later in 2018, Li et al. [Lukasz Kaiser; Samy Bengio | Google Research/Brain] described “Area Attention” (Oct 2018).

“Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.”

[Image source. Click image to open in new window.]

Linguistically-Informed Self-Attention for Semantic Role Labeling (2018) [codediscussion (item 10)], by Andrew McCallum and colleagues, presented Linguistically-Informed Self-Attention  (LISA): a neural network model that combined multi-head self-attention with multi-task learning across dependency parsing, part of speech tagging, predicate detection and semantic role labeling (SRL). Unlike previous models which required significant pre-processing to prepare linguistic features, LISA could incorporate syntax using merely raw tokens as input, encoding the sequence only once to simultaneously perform parsing, predicate detection and role labeling for all predicates. Syntax was incorporated by training one attention head to attend to syntactic parents for each token. Moreover, if a high-quality syntactic parse was already available, it could be beneficially injected at test time without re-training their SRL model. In experiments on CoNLL-2005 SRL, LISA achieved new state of the art performance for a model using predicted predicates and standard word embeddings, attaining 2.5 $\small F_1$ absolute higher than the previous state-of-the-art on newswire and more than 3.5 $\small F_1$ on out of domain data, nearly 10% reduction in error. On ConLL-2012 English SRL they also showed an improvement of more than 2.5 $\small F_1$. LISA also outperformed the state of the art with contextually-encoded (ELMo) word representations, by nearly 1.0 $\small F_1$ on news and more than 2.0 $\small F_1$ on out of domain text.

• “This paper has a lot to like: a Transformer trained jointly on both syntactic and semantic tasks; the ability to inject high-quality parses at test time; and out-of-domain evaluation. It also regularizes the Transformer’s multi-head attention to be more sensitive to syntax by training one attention head to attend to the syntactic parents of each token. We will likely see more examples of Transformer attention heads used as auxiliary predictors focusing on particular aspects of the input.” [Source]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Jan 2019  |  ICLR 2019) [codeOpenReviewGoogle AI blog;  discussion here and here;  mentioned here] – by William Cohen, Quoc Le, Ruslan Salakhutdinov and colleagues – proposed a novel neural architecture, Transformer-XL, that enabled Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Transformer-XL consists of a segment-level recurrence mechanism and a novel positional encoding scheme that not only enabled capturing longer-term dependency, but also resolved the problem of context fragmentation. As a result, Transformer-XL learned dependencies that were about 80% longer than RNNs and 450% longer than vanilla Transformers, achieving better performance on both short and long sequences, and was up to 1,800+ times faster than vanilla Transformer. They additionally improved the state of the art results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

• “We propose a novel architecture, Transformer-XL, for language modeling with self-attention architectures beyond a fixed-length context. Our main technical contributions include introducing the notion of recurrence in a purely self-attentive model and deriving a novel positional encoding scheme. These two techniques form a complete set of solutions, as any one of them alone does not address the issue of fixed-length contexts. Transformer-XL is the first self-attention model that achieves substantially better results than RNNs on both character-level and word-level language modeling. Transformer-XL is also able to model longer-term dependency than RNNs and Transformer, and achieves substantial speedup during evaluation compared to vanilla Transformers.”

[Image source. Click image to open in new window.]

The Evolved Transformer (Feb 2019) by Google Brain presented the Evolved Transformer, demonstrating consistent improvement over Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At the big model size, Evolved Transformer was twice as efficient as the Transformer in FLOPS without loss in quality. At a much smaller (mobile-friendly) model size of ~7M parameters, Evolved Transformer outperformed Transformer by 0.7 BLEU on WMT’14 English-German.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

### Trellis Networks

Trellis Networks for Sequence Modeling (Oct 2018; note also the Appendices) [codediscussion] by authors at Carnegie Mellon University and Intel Labs presented trellis networks, a new architecture for sequence modeling. A trellis network is a temporal convolutional network with special structure, characterized by weight tying across depth and direct injection of the input into deep layers. The authors show that truncated recurrent networks are equivalent to trellis networks with special sparsity structure in their weight matrices; thus trellis networks with general weight matrices generalize truncated recurrent networks. They leveraged those connections to design high-performing trellis networks that absorb structural and algorithmic elements from both recurrent and convolutional models. Experiments demonstrated that trellis networks outperform the current state of the art on a variety of challenging benchmarks, including word-level language modeling on Penn Treebank and WikiText-103, character-level language modeling on Penn Treebank, and stress tests designed to evaluate long-term memory retention.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

We presented trellis networks, a new architecture for sequence modeling. Trellis networks form a structural bridge between convolutional and recurrent models. …”

“There are many exciting opportunities for future work. First, we have not conducted thorough performance optimizations on trellis networks. … Future work can also explore acceleration schemes that speed up training and inference. Another significant opportunity is to establish connections between trellis networks and self-attention-based architectures (Transformers), thus unifying all three major contemporary approaches to sequence modeling. Finally, we look forward to seeing applications of trellis networks to industrial-scale challenges such as machine translation.”

Neural language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. In The Importance of Generation Order in Language Modeling Google Brain studied the influence of token generation order on model quality via a novel two-pass language model that produced partially-filled sentence “templates” and then filled in missing tokens. The most effective strategy generated function words in the first pass followed by content words in the second.

The Fine Tuning Language Models for Multilabel Prediction GitHub repository lists recent, leading language models – for which they examine the ability to use generative pretraining with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.

Adaptive Input Representations for Neural Language Modeling (Facebook AI Research; Oct 2018) [mentioned] introduced adaptive input representations for neural language modeling which extended the adaptive softmax of Grave et al. (2017 to input representations of variable capacity. This paper introduced adaptive input embeddings, which extended the adaptive softmax to input word representations. This factorization assigned more capacity to frequent words, and reduced the capacity for less frequent words with the benefit of reducing overfitting to rare words. There were several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. They performed a systematic comparison of popular choices for a self-attentional architecture. Their experiments showed that models equipped with adaptive embeddings were more than twice as fast to train than the popular character input CNN while having a lower number of parameters. They achieved a new state of the art on the WikiText-103 benchmark of 20.51 perplexity, improving the next best known result by 8.7 perplexity. On the Billion Word Benchmark, they achieved a state of the art of 24.14 perplexity.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Grave et al. (Facebook AI Research) Efficient softmax approximation for GPUs (Sep 2016; updated Jun 2017) [code]

“We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computation time. Our approach further reduces the computational time by exploiting the specificities of modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax.”

[Image source. Note similarity to Fig. 1 / use in Adaptive Input Representations for Neural Language Modeling. Click image to open in new window.]

### Probing the Effectiveness of Pretrained Language Models

Contextual word representations derived from pretrained bidirectional language models (biLM) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks, including question answering, entailment and sentiment classification, constituency parsing, named entity recognition, and text classification. However, many questions remain as to how and why these models are so effective.

Deep RNNs Encode Soft Hierarchical Syntax (May 2018), by Terra Blevins, Omer Levy, and Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, evaluated how well a simple feedforward classifier could detect syntax features (part of speech tags as well as various levels of constituent labels) from the word representations produced by the RNN layers from deep NLP models trained on the tasks of dependency parsing, semantic role labeling, machine translation, and language modeling. They demonstrated that deep RNN trained on NLP tasks learned internal representations that captured soft hierarchical notions of syntax across different layers of the model (i.e., the representations taken from deeper layers of the RNNs perform better on higher-level syntax tasks than those from shallower layers), without explicit supervision. These results provided some insight as to why deep RNNs are able to model NLP tasks without annotated linguistic features. ELMo, for example, represents each word using a task-specific weighted sum of the language models hidden layers; i.e., rather than using only the top layer, ELMo selects which of the language models internal layers contain the most relevant information for the task at hand.

An extremely interesting follow-on paper from Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, with colleagues at the Allen Institute for Artificial Intelligence – Dissecting Contextual Word Embeddings: Architecture and Representation (August 2018) [note also the Appendices in that paper] – presented a detailed empirical study of how the choice of neural architecture (e.g. LSTMCNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. They showed there is a tradeoff between speed and accuracy, but all architectures learned high quality contextual representations that outperformed word embeddings for four challenging NLP tasks (natural language inference/textual entailment; semantic role labeling; constituency parsing; named entity recognition).

• That study also showed that Deep biLM learned a rich hierarchy of contextual information, both at the word and span level, that was captured in three disparate types of network architectures (LSTMCNN, or self attention). In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network in an analogous manner to how deep CNNs trained for image classification learn a hierarchy of image features (Zeiler and Fergus, 2014). For example, they showed that in contrast to traditional word vectors which encode some semantic information, the word embedding layer of deep biLMs focused exclusively on word morphology. Moving upward in the network, the lowest contextual layers of biLM focused on local syntax, while the upper layers could be used to induce more semantic content such as within-sentence pronominal coreferent clusters. They also showed that the biLM activations could be used to form phrase representations useful for syntactic tasks.

Together, these results suggest that large scale biLM, independent of architecture, are learning much more about the structure of language than previous appreciated.

Regarding the following figure, note the more-or-less similar behavior of the three models on various tasks, that differ in difficulty/complexity; particularly note the changes in accuracy throughout the depth (layers) in those models. Layer-wise quantitative data are provided in the Appendix, in that paper.

[Image source. Click image to open in new window.]

Evaluation of Sentence Embeddings in Downstream and Linguistic Probing Tasks (Jun 2018) surveyed recent unsupervised word embedding models, including fastTextELMoInferSent, and other models (discussed elsewhere in this REVIEW). They noted that two main challenges exist when learning high-quality representations: they should capture semanticssyntax,  and the different meanings the word can represent in different contexts (polysemy).

ELMo addressed both of those issues. As in fastText, ELMo breaks the tradition of word embeddings by incorporating sub-word units, but ELMo has also some fundamental differences with previous shallow representations such as fastText or Word2Vec. ELMo uses a deep representation by incorporating internal representations of the LSTM network, therefore capturing the meaning and syntactical aspects of words. Since ELMo is based on a language model, each token representation is a function of the entire input sentence, which can overcome the limitations of previous word embeddings where each word is usually modeled as an average of their multiple contexts. ELMo embeddings provide a better understanding of the contextual meaning of a word, as opposed to traditional word embeddings that are not only context-independent but have a very limited definition of context.

In that paper, it was also interesting to see how the different models performed on different tasks. For example:

• As discussed in Section 5.1/  (), ELMo (a language model that employs two Bi-LSTM layers), the Transformer (attention-based) version of USE (Universal Sentence Encoder), and InferSent (a Bi-LSTM trained on the SNLI dataset) generally performed well on downstream classification tasks (Table 6).

“As seen in Table 6, although no method had a consistent performance among all tasks, ELMo achieved best results in 5 out of 9 tasks. Even though ELMo was trained on a language model objective, it is important to note that in this experiment a bag-of-words approach was employed. Therefore, these results are quite impressive, which lead us to believe that excellent results can be obtained by integrating ELMo and [it’s] trainable task-specific weighting scheme into InferSent. InferSent achieved very good results in the paraphrase detection as well as in the SICK-E (entailment). We hypothesize that these results were due to the similarity of these tasks to the tasks were InferSent was trained on (SNLI and MultiNLI). … The Universal Sentence Encoder (USE) model with the Transformer encoder also achieved good results on the product review (CR) and on the question-type (TREC) tasks. Given that the USE model was trained on SNLI as well as on web question-answer pages, it is possible that these results were also due to the similarity of these tasks to the training data employed by the USE model.”

• Discussed in Section 5.2/  () USE-Transformer and InferSent performed the best on semantic relatedness and textual similarity tasks.

• Discussed in Section 5.3/  () ELMo generally outperformed the other models on linguistic probing tasks.

• Discussed in Section 5.4/  (), InferSent outperformed the other models in information retrieval tasks.

Neural language models (LM) are more capable of detecting long distance dependencies than traditional n-gram models, serving as a stronger model for natural language. However, it is unclear what kind of properties of language these models encode, preventing their use as explanatory models, and relating them to formal linguistic knowledge of natural language. There is increasing interest in investigating the kinds of linguistic information that are represented by LM, with a strong focus on their syntactic abilities, as well as semantic understanding, such as negative polarity items (NPI). NPI are a class of words that bear the special feature that they need to be licensed by a specific licensing context (LC). A common example of an NPI and LC in English are any and not, respectively: the sentence “He didn’t buy any books.” is correct, whereas “He did buy any books.” is not correct.

Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items (Aug 2018) discussed language models and negative polarity items showing that the model found a relation between the licensing context and the negative polarity item and appeared to be aware of the scope of this context, which they extracted from a parse tree of the sentence. This research paves the way for other studies linking formal linguistics to deep learning.

[Image source. Click image to open in new window.]

Character language models have access to surface morphological patterns, but it is not clear whether or how they learn abstract morphological regularities. Indicatements that Character Language Models Learn English Morpho-syntactic Units and Regularities (Aug 2018) studied a “wordless” character language model with several probes, finding that it could develop a specific unit to identify word boundaries and, by extension, morpheme boundaries, which allowed it to capture linguistic properties and regularities of these units. Their language model proved surprisingly good at identifying the selectional restrictions of English derivational morphemes, a task that required both morphological and syntactic awareness. They concluded that, when morphemes overlap extensively with the words of a language, a character language model can perform morphological abstraction.

A morpheme is a meaningful morphological unit of a language that cannot be further divided; e.g., “incoming” consists of the morphemes “in”, “come” and “-ing”. Another example: “dogs” consists of two morphemes and one syllable: “dog”, and “-s”. A morpheme may or may not stand alone, whereas a word, by definition, is freestanding.

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (Sep 2018) [code] showed that each embedding model captured more information than is directly apparent, yet their potential performance is limited by the impossibility of optimally surfacing divergent linguistic information at the same time. For example, in word analogy experiments they were are able to achieve significant improvements over the original embeddings, yet every improvement in semantic analogies came at the cost of a degradation in syntactic analogies and vice versa. At the same time, their work showed that the effect of this phenomenon was different for unsupervised systems that directly used embedding similarities and supervised systems that use pretrained embeddings as features, as the latter had enough expressive power to learn the optimal balance themselves.

Relevant to the language models domain (if not directly employed), Firearms and Tigers are Dangerous, Kitchen Knives and Zebras are Not: Testing whether Word Embeddings Can Tell (Sep 2018) presented an approach to investigating the nature of semantic information captured by word embeddings. They tested the ability of supervised classifiers (a logistic regression classifier, and a basic neural network) to identify semantic features in word embedding vectors and compared this to a feature identification method based on full vector cosine similarity. The idea behind this method was that properties identified by classifiers (but not through full vector comparison) are captured by embeddings; properties that cannot be identified by either method are not captured by embeddings. Their results provided an initial indication that semantic properties relevant for the way entities interact (e.g. dangerous) were captured, while perceptual information (e.g. colors) were not represented.

Generative adversarial networks (GANs) are a promising approach for text generation that, unlike traditional language models (LM), does not suffer from the problem of “exposure bias”. However, A major hurdle for understanding the potential of GANs for text generation is the lack of a clear evaluation metric. Evaluating Text GANs as Language Models (Oct 2018) proposed approximating the distribution of text generated by a GAN, which permitted evaluating them with traditional probability-based LM metrics. They applied their approximation procedure on several GAN-based models, showing that they performed substantially worse than state of the art LMs. Their evaluation procedure promoted better understanding of the relation between GANs and LMs, and could accelerate progress in GAN-based text generation.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In a comparative review of pretrained language models, Looking for ELMo’s friends: Sentence-Level Pretraining Beyond Language Modeling (Dec 2018) [code] – by authors at New York University, Brown University, Google AI Language, Facebook, Johns Hopkins University, IBM, University of Minnesota Duluth, and Swarthmore College – conducted a large-scale systematic study comparing different language modeling pretraining tasks.

• While the primary results of the study supported the use of language modeling as a pretraining task and set a new state of the art among comparable models using multitask learning with language models, a closer look revealed worryingly strong baselines and strikingly varied results across target tasks, suggesting that the widely-used paradigm of pretraining and freezing sentence encoders may not be an ideal platform for further work.

• “We implement our models using the AllenNLP toolkit (Gardner et al., 2017), aiming to build the simplest architecture that could be reasonably expected to perform well on the target tasks under study [code]. The design of the models roughly follows that used in the GLUE baselines and ELMo.”

• Conclusions (paraphrased).

• This paper presents a systematic comparison of tasks and task-combinations for the pretraining of sentence-level BiLSTM encoders like those seen in ELMo and CoVe.

• Language modeling works well as a pretraining task, and no other single task is consistently better. Multitask pretraining can produce results better than any single task can, and sets a new state of the art among comparable models.

• However, a closer look at our results suggests that the pretrain-and-freeze paradigm that underlies ELMo and CoVe might not be a sound platform for future work: some trivial baselines do strikingly well, the margins between pretraining tasks are small, and some pretraining configurations (such as MNLI) yield better performance with less data. This suggests that we may be nearing an upper bound on the performance that can be reached with methods like these.

• In addition, different tasks benefit from different forms of pretraining to a striking degree – with correlations between target tasks often low or negative – and multitask pretraining tasks fail to reliably produce models better than their best individual components. This suggests that if truly general purpose sentence encoders are possible, our current methods cannot produce them.

Language Models:

• Improving Sentence Representations with Multi-view Frameworks

“… we present two multi-view frameworks for learning sentence representations in an unsupervised fashion. One framework uses a generative objective and the other a discriminative one. In both frameworks, the final representation is an ensemble of two views, in which one view encodes the input sentence with a recurrent neural network (RNN) and the other view encodes it with a simple linear model. We show that, after learning, the vectors produced by our multi-view frameworks provide improved representations over their single-view learned counterparts, and the combination of different views gives representational improvement over each view and demonstrates solid transferability on standard downstream tasks.”

• “… Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings.”

• BioWordVec: biomedical word embeddings with fastText. We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms).

BioSentVec [1]: biomedical sentence embeddings with sent2vec. We applied sent2vec to compute the 700-dimensional sentence embeddings. We used the bigram model and set window size to be 20 and negative examples 10.

• “RNN language models have achieved state-of-the-art results on various tasks, but what exactly they are representing about syntax is as yet unclear. Here we investigate whether RNN language models learn humanlike word order preferences in syntactic alternations. We collect language model surprisal [← sic] scores for controlled sentence stimuli exhibiting major syntactic alternations in English: heavy NP shift, particle shift, the dative alternation, and the genitive alternation. We show that RNN language models reproduce human preferences in these alternations based on NP length, animacy, and definiteness. We collect human acceptability ratings for our stimuli, in the first acceptability judgment experiment directly manipulating the predictors of syntactic alternations. We show that the RNNs’ performance is similar to the human acceptability ratings and is not matched by an n-gram baseline model. Our results show that RNNs learn the abstract features of weight, animacy, and definiteness which underlie soft constraints on syntactic alternations.”

• “In the biomedical domain, the lack of sharable datasets often limit the possibility of developing natural language processing systems, especially dialogue applications and natural language understanding models. To overcome this issue, we explore data generation using templates and terminologies and data augmentation approaches. Namely, we report our experiments using paraphrasing and word representations learned on a large EHR corpus with fastText and ELMo, to learn a NLU model without any available dataset. We evaluate on a NLU task of natural language queries in EHRs divided in slot-filling and intent classification sub-tasks. On the slot-filling task, we obtain a F-score of 0.76 with the ELMo representation; and on the classification task, a mean F-score of 0.71. Our results show that this method could be used to develop a baseline system.”

### Probing Hierarchical Syntax Embedded in Pretrained Language Model Layers

Here I collate and summarize/paraphrase discussion that relates to the soft hierarchical syntax captured in various layers (embeddings) in pretrained language models. Very exciting and very powerful.

• Richard Socher and colleagues [SalesForce: A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks (Jul 2017)] introduced a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. Higher layers included shortcut connections (i.e., residual layers) to lower-level task predictions to reflect linguistic hierarchies. They used a simple regularization term to allow for optimizing all model weights to improve the loss of a single task, without exhibiting catastrophic interference of the other tasks.

• “… We presented a joint many-task model to handle multiple NLP tasks with growing depth in a single end-to-end model. Our model is successively trained by considering linguistic hierarchies, directly feeding word representations into all layers, explicitly using low-level predictions, and applying successive regularization. In experiments on five NLP tasks, our single model achieves the state-of-the-art or competitive results on chunking, dependency parsing, semantic relatedness, and textual entailment.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Deep RNNs Encode Soft Hierarchical Syntax (May 2018), by Terra Blevins, Omer Levy, and Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, evaluated how well a simple feedforward classifier could detect syntax features (part of speech tags as well as various levels of constituent labels) from the word representations produced by the RNN layers from deep NLP models trained on the tasks of dependency parsing, semantic role labeling, machine translation, and language modeling. They demonstrated that deep RNN trained on NLP tasks learned internal representations that captured soft hierarchical notions of syntax across different layers of the model (i.e., the representations taken from deeper layers of the RNNs perform better on higher-level syntax tasks than those from shallower layers), without explicit supervision. These results provided some insight as to why deep RNNs are able to model NLP tasks without annotated linguistic features. ELMo, for example, represents each word using a task-specific weighted sum of the language models hidden layers; i.e., rather than using only the top layer, ELMo selects which of the language models internal layers contain the most relevant information for the task at hand.

• Specifically, they trained the models to predict POS tags as well as constituent labels at different depths of a parse tree. They found that all models indeed encoded a significant amount of syntax and – in particular – that language models learned some syntax.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Retrieve-and-Read: Multi-task Learning of Information Retrieval and Reading Comprehension (Aug 2018) presented an interesting approach, “machine reading at scale” (MRS) wherein, given a question, a system retrieves passages relevant to the question from a corpus (IR: information retrieval) and then extracts the answer span from the retrieved passages (RC: reading comprehension). …

[Image source. Click image to open in new window.]

• “Our Retrieve-and-Read model is based on the bi-directional attention flow (BiDAF ) model, which is a standard RC model. As shown in Figure 2 [above] it consists of six layers: … We note that the RC component trained with single-task learning is essentially equivalent to BiDAF, except for the word embedding layer that has been modified to improve accuracy. … Note that the original BiDAF uses a pre-trained GloVe and also trains character-level embeddings by using a CNN in order to handle out-of-vocabulary (OOV) or rare words. Instead of using GloVe and CNN, our model uses fastText for the fixed pre-trained word vectors and removes character-level embeddings. The fastText model takes into account subword information and can obtain valid representations even for OOV words.”

• Much effort has been devoted to evaluating whether multitask learning can be leveraged to learn rich representations that can be used in various NLP downstream applications. However, there is still a lack of understanding of the settings in which multitask learning has a significant effect. A Hierarchical Multitask Approach for Learning Embeddings from Semantic Tasks (Nov 2018) [codedemomedia], by Sanh et al. [Sebastian Ruder], introduced a hierarchical model trained in a multitask learning setup on a set of carefully selected semantic tasks. The model was trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieved state of the art results on a number of tasks – named entity recognition, entity mention detection and relation extraction – without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induced a set of shared semantic representations at lower layers of the model. They showed that as they moved from the bottom to the top layers of the model, the hidden states of the layers tended to represent more complex semantic information.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Demo (note errors!):

[Image source. Click image to open in new window.]

[Probing Hierarchical Syntax Embedded in Pretrained Language Model Layers:]

• Gated Self-Matching Networks  (R-Net) (2017) – multilayer, end-to-end neural networks whose novelty lay in the use of a gated attention mechanism to provide different levels of importance to different parts of passages. It also used a self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contained character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer.

[Image source. Click image to open in new window.]

• Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. However, we still do not have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn.

Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis compared four objectives – language modeling, translation, skip-thought, and autoencoding – on their ability to induce syntactic and part of speech information. They made a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. They found that representations from language models consistently performed best on their syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggested that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. They also found that the representations from randomly-initialized, frozen LSTMs performed strikingly well on their syntactic auxiliary tasks, but that effect disappeared when the amount of training data for the auxiliary tasks was reduced.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Hierarchical Multitask Learning for CTC-based Speech Recognition (Jul 2018) [Summary]Previous work has shown that neural encoder-decoder speech recognition can be improved with hierarchical multitask learning, where auxiliary tasks are added at intermediate layers of a deep encoder. We explore the effect of hierarchical multitask learning in the context of connectionist temporal classification (CTC)-based speech recognition, and investigate several aspects of this approach. Consistent with previous work, we observe performance improvements on telephone conversational speech recognition (specifically the Eval2000 test sets) when training a subword-level CTC model with an auxiliary phone loss at an intermediate layer. We analyze the effects of a number of experimental variables (like interpolation constant and position of the auxiliary loss function), performance in lower-resource settings, and the relationship between pretraining and multitask learning. We observe that the hierarchical multitask approach improves over standard multitask training in our higher-data experiments, while in the low-resource settings standard multitask training works well. The best results are obtained by combining hierarchical multitask learning and pretraining, which improves word error rates by 3.4% absolute on the Eval2000 test sets.

## RNN, CNN, or Self-Attention?

In the course of writing this REVIEW and in my other readings I often encountered discussions of RNN vs. CNN. vs. self-attention architectures in regard to NLP and language models. Here, I collate and summarize/paraphrase some of those observations; green-colored URL are internal hyperlinks to discussions of those items elsewhere in this REVIEW.

• Dissecting Contextual Word Embeddings: Architecture and Representation (Aug 2018) discussed contextual word representations derived from pretrained bidirectional language models (biLM), showing that Deep biLM learned a rich hierarchy of contextual information that was captured in three disparate types of network architectures: LSTMCNN, or self attention. In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network.

• Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers (Jun 2018) studied the role of linguistic context in predicting quantifiers (“few”, “all”). Overall, LSTM were the best-performing architectures, with CNN showing some potential in the handling of longer sequences.

[Image source. Click image to open in new window.]

• Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (May 2018) discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of a simple RNN (S-RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicated that LSTM suffer little to no performance loss when removing the S-RNN. This provided evidence that the gating mechanism was doing the heavy lifting in modeling context. They further ablated the recurrence in each gate and found that this incurred only a modest drop in performance, indicating that the real modeling power of LSTM stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allowed them to mathematically relate LSTM and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enabled the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.

• While RNN are a cornerstone in learning latent representations from long text sequences, a purely convolutional and deconvolutional autoencoding framework may be employed, as described in Deconvolutional Paragraph Representation Learning (Sep 2018). That paper addressed the issue that the quality of sentences during RNN-based decoding (reconstruction) decreased with the length of the text. Compared to RNN, their framework was better at reconstructing and correcting long paragraphs. Note Table 1 in their paper (showing by LSTM and CNN, as well as the vastly superior BLEU / ROUGE scores in ); there is also additional NLP-related LSTM vs. CNN discussion in this Hacker News thread.

• Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition (Aug 2018) compared the use of LSTM based and CNN based character level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus showed that the use of either type of character level word embeddings in conjunction with the BiLSTM-CRF models led to comparable state of the art performance. However, the models using CNN based character level word embeddings had a computational performance advantage, increasing training time over word based models by 25% while the LSTM based character level word embeddings more than doubled the required training time.

• Recently, non-recurrent architectures (convolutional; self-attentional) have outperformed RNN in neural machine translation. CNN and self-attentional networks can connect distant words via shorter network paths than RNN, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument had not been tested empirically, nor had alternative explanations for their strong performance been explored in-depth. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (Aug 2018) hypothesized that the strong performance of CNN and self-attentional networks could be due to their ability to extract semantic features from the source text. They evaluated RNN, CNN and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Experimental results showed that self-attentional networks and CNN did not outperform RNN in modeling subject-verb agreement over long distances, and that self-attentional networks performed distinctly better than RNN and CNN on word sense disambiguation.

[Image source. Click image to open in new window.]

• Recent advances in network architectures for neural machine translation (NMT) have effectively replaced recurrent models with either convolutional or self-attentional approaches, such as the Transformer architecture. While the main innovation of Transformer was its use of self-attentional layers, there are several other aspects – such as attention with multiple heads, and the use of many attention layers – that distinguished the model from previous baselines. How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures (2018) [code] took take a fine-grained look at the different architectures for NMT. They introduced an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks. Making use of that language, they showed that one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention. Additionally, they found that self-attention was much more important for the encoder than for the decoder, where in most settings it could be replaced by a RNN or CNN without a loss in performance. Surprisingly, even a model without any target side self-attention performed well.

“We found that RNN based models benefit from multiple source attention mechanisms and residual feed-forward blocks. CNN based models on the other hand can be improved through layer normalization and also feed-forward blocks. These variations bring the RNN and CNN based models close to the Transformer. Furthermore, we showed that one can successfully combine architectures. We found that self-attention is much more important on the encoder side than it is on the decoder side, where even a model without self-attention performed surprisingly well. For the data sets we evaluated on, models with self-attention on the encoder side and either an RNN or CNN on the decoder side performed competitively to the Transformer model in most cases.”

• Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (Aug 2018) proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance. They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy. All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction. In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

• The architecture proposed in *QANet* : Combining Local Convolution with Global Self-Attention for Reading Comprehension (Apr 2018) did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD1.1 dataset their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data.

• Likewise, a later paper, A Fully Attention-Based Information Retriever (Oct 2018) that also relied entirely on a (convolutional and/or) self-attentional model achieved competitive results on SQuAD1.1 while having fewer parameters and being faster at both learning and inference than rival (largely RNN-based) methods. Their FABIR model was significantly outperformed by the highly similar – and non-cited – competing QANet model.

• Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension (Jun 2018), performed as well as QANet. Based on a Bi-LSTM, Reinforced Mnemonic Reader is an enhanced attention reader – suggesting perhaps that the improvements in QANet, Reinforced Mnemonic Reader, and the work described in Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (second preceding paragraph) was due to the attention mechanisms, rather than the RNN or CNN architectures.

• Likewise, Constituency Parsing with a Self-Attentive Encoder (May 2018) [code] demonstrated that replacing a LSTM encoder with a self-attentive architecture could lead to improvements to a state of the art discriminative constituency parser. The use of attention made explicit the manner in which information was propagated between different locations in the sentence; for example, separating positional and content information in the encoder led to improved parsing accuracy. They evaluated a version of their model that used ELMo as the sole lexical representation, using publicly available ELMo weights. Trained on the Penn Treebank, their 93.55 $\small F_1$ without the use of any external data, and 95.13 $\small F_1$ when using pre-trained word representations. The gains came not only from incorporating more information (such as subword features or externally trained word representations), but also from structuring the architecture to separate different kinds of information from each other.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Jan 2019) [code] – by William Cohen, Quoc Le, Ruslan Salakhutdinov and colleagues – proposed a novel neural architecture, Transformer-XL, that enabled Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Transformer-XL learned dependencies that were about 80% longer than RNNs and 450% longer than vanilla Transformers, achieving better performance on both short and long sequences, and was up to 1,800+ times faster than vanilla Transformer. Transformer-XL is the first self-attention model that achieves substantially better results than RNNs on both character-level and word-level language modeling.

## LSTM, Attention and Gated (Recurrent) Units

Here I collate and summarize/paraphrase gated unit mechanism-related discussion from elsewhere in this REVIEW.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Sep 2014) by Kyunghyun Cho et al. [Yoshua Bengio] described a RNN encoder-decoder for statistical machine translation, that introduced a new type of hidden unit ($\small \mathcal{f}$ in the equation, below) – the gated recurrent unit (GRU) – that was motivated by the LSTM unit but was much simpler to compute and implement.

Recurrent Neural Networks. A recurrent neural network (RNN) is a neural network that consists of a hidden state $\small \mathbf{h}$ and an optional output $\small \mathbf{y}$ which operates on a variable-length sequence $\small \mathbf{x} = (x_1, \ldots, x_T)$. At each time step $\small t$, the hidden state $\small \mathbf{h_{\langle t \rangle}}$ of the RNN is updated by

$\small \mathbf{h_{\langle t \rangle}} = f (\mathbf{h_{\langle t-1 \rangle}}, x_t)$,
where $\small \mathcal{f}$ is a non-linear activation function. $\small \mathcal{f}$ may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997).

Hidden Unit that Adaptively Remembers and Forgets. ... we also propose a new type of hidden unit ($\small f$ in the equation, above) that has been motivated by the LSTM unit but is much simpler to compute and implement. [The LSTM unit has a memory cell and four gating units that adaptively control the information flow inside the unit, compared to only two gating units in the proposed hidden unit.]

This figure shows the graphical depiction of the proposed hidden unit:

[Image source. Click image to open in new window.]

"In this formulation [see Section 2.3 in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation for details], when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

"On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit (Bengio et al., 2013). As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active. ..."

In their very highly cited paper Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014; updated May 2016), Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio employed their “gated hidden unit” (a GRU) – introduced by Cho et al. (2014) (above) – for neural machine translation. Their model consisted of a forward and backward pair of RNN (BiRNN) for the encoder, and a decoder that emulated searching through a source sentence while decoding a translation.

From Appendix A in that paper:

“For the activation function $\small f$ of an RNN, we use the gated hidden unit recently proposed by Cho et al. (2014a). The gated hidden unit is an alternative to the conventional simple units such as an element-wise $\small \text{tanh}$. This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies. This is made possible by having computation paths in the unfolded RNN for which the product of derivatives is close to 1. These paths allow gradients to flow backward easily without suffering too much from the vanishing effect. It is therefore possible to use LSTM units instead of the gated hidden unit described here, as was done in a similar context by Sutskever et al. (2014).”

Discussed in Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2018):

“The attention mechanism has been a breakthrough in neural machine translation (NMT) in recent years. This mechanism calculates how much attention the network should give to each source word to generate a specific translated word. The context vector calculated by the attention mechanism mimics the syntactic skeleton of the input sentence precisely given a sufficient number of examples. Recent work suggests that incorporating explicit syntax alleviates the burden of modeling grammatical understanding and semantic knowledge from the model.”

GRUs have fewer parameters than LSTM, as they lack an output gate. A GRU has two gates (an update gate and reset gate), while a RNN has three gates (update, forget and output gates). The GRU update gate decides on how much of information from the past should be let through, while the reset gate decides on how much of information from the past should be discarded. What motivates this? Although RNNs can theoretically capture long-term dependencies, they are actually very hard to train to do this [see this discussion]. GRUs are designed to have more persistent memory, thereby making it easier for RNNs to capture long-term dependencies. Even though computationally a GRU is more efficient than an LSTM network, due to the reduction of gates it still comes second to LSTM network in terms of performance. Therefore, GRUs are often used when we need to train faster, and we don’t have much computational power.

Counting in Language with RNNs (Oct 2018) examined a possible reason for LSTM outperforming GRU on language modeling and more specifically machine translation. They hypothesized that this had to do with counting – a consistent theme across the literature of long term dependence, counting, and language modeling for RNNs. Using the simplified forms of language – context-free and context-sensitive languages – they showed how the LSTM performs its counting based on their cell states during inference, and why the GRU cannot perform as well.

“As argued in the Introduction, we believe there is a lot of evidence supporting the claim that success at language modeling requires an ability to count. Since there is empirical support for the fact that the LSTM outperforms the GRU in language related tasks, we believe that our results showing how fundamental this inability to count is for the GRU, we believe we make a contribution to the study of both RNNs and their success on language related tasks. Our experiments along with the other recent paper by Weiss et al. [2017], show almost beyond reasonable doubt that the GRU is not able to count as well as the LSTM, furthering our hypothesis that there is a correlation between success at performance on language related tasks and the ability to count.”

Germane to this subsection (“LSTM, Attention and Gated (Recurrent) Units”) is the excellent companion blog post to When Recurrent Models Don't Need To Be Recurrent, in which coauthor John Miller discusses a very interesting paper by Dauphin et al., Language Modeling with Gated Convolutional Networks (Sep 2017). Some highlights from that paper:

• Gating has been shown to be essential for recurrent neural networks to reach state-of-the-art performance. Our gated linear units reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities (Section 5.2). We show that gated convolutional networks outperform other recently published language models such as LSTMs trained in a similar setting on the Google Billion Word Benchmark (Chelba et al., 2013). …

• “Gating mechanisms control the path through which information flows in the network and have proven to be useful for recurrent neural networks. LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep. In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers. We show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. …

• “Gated linear units are a simplified gating mechanism based on the work of Dauphin & Grangier [Predicting distributions with Linearizing Belief Networks (Nov 2015; updated May 2016)] for non-deterministic gates that reduce the vanishing gradient problem by having linear units coupled to the gates. This retains the non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. … We compare the different gating schemes experimentally in Section 5.2 and we find gated linear units allow for faster convergence to better perplexities.”

• “The unlimited context offered by recurrent models is not strictly necessary for language modeling.”

“In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

• “The ‘infinite memory’ advantage of RNNs is largely absent in practice.”

As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM $\small n$-gram model with $\small n=13$ words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

• “Recurrent models trained in practice are effectively feedforward.”

This could happen either because truncated backpropagation time cannot learn patterns significantly longer than $\small k$ steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

Gated-Attention Readers for Text Comprehension (Jun 2016; updated Apr 2017) [Theano codeupdated (TensorFlow) codeupdated (PyTorch) code], by Ruslan Salakhutdinov and colleagues, employed the attention mechanism introduced by Yoshua Bengio and colleagues (Neural Machine Translation by Jointly Learning to Align and Translate) in their model, the Gated-Attention Reader (GA Reader). The GA Reader integrated a multi-hop architecture with a novel attention mechanism, which was based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enabled the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtained state of the art results on three benchmarks for this task. The effectiveness of multiplicative interaction was demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention.

“Deep learning models have been shown to outperform traditional shallow approaches on text comprehension tasks. The success of many recent models can be attributed primarily to two factors:

1. Multi-hop architectures allow a model to scan the document and the question iteratively for multiple passes.
2. Attention mechanisms, borrowed from the machine translation literature, allow the model to focus on appropriate subparts of the context document.

Intuitively, the multi-hop architecture allows the reader to incrementally refine token representations, and the attention mechanism re-weights different parts in the document according to their relevance to the query.

… In this paper, we focus on combining both in a complementary manner, by designing a novel attention mechanism which gates the evolving token representations across hops. … More specifically, unlike existing models where the query attention is applied either token-wise or sentence-wise to allow weighted aggregation, the Gated-Attention module proposed in this work allows the query to directly interact with each dimension of the token embeddings at the semantic-level, and is applied layer-wise as information filters during the multi-hop representation learning process. Such a fine-grained attention enables our model to learn conditional token representations with respect to the given question, leading to accurate answer selections.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

A recent review, Comparative Analysis of Neural QA Models on SQuAD (Jun 2018), reported that models based on a gated attention mechanism (R-Net ), or a GRU (DocQA ), performed well across a variety of tasks.

Gated Self-Matching Networks  (R-Net) – proposed by Wang et al. (2017) [code] – were multilayer, end-to-end neural networks whose novelty lay in the use of a gated attention mechanism to provide different levels of importance to different parts of passages. It also used a self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contained character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer.

[Image source. Click image to open in new window.]

“… we present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage. We first match the question and passage with gated attention-based recurrent networks to obtain the question-aware passage representation. Then we propose a self-matching attention mechanism to refine the representation by matching the passage against itself, which effectively encodes information from the whole passage. We finally employ the pointer networks to locate the positions of answers from the passages. We conduct extensive experiments on the SQuAD dataset. The single model achieves 71.3% on the evaluation metrics of exact match on the hidden test set, while the ensemble model further boosts the results to 75.9%. At the time of submission of the paper, our model holds the first place on the SQuAD Leaderboard for both single and ensemble model.”

• “We choose to use Gated Recurrent Unit (GRU) (Cho et al., 2014) in our experiment since it performs similarly to LSTM (Hochreiter and Schmidhuber, 1997) but is computationally cheaper. … We propose a gated attention-based recurrent network to incorporate question information into passage representation. It is a variant of attention-based recurrent networks, with an additional gate to determine the importance of information in the passage regarding a question.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Facebook AI Research recently (May 2018) developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation), demonstrated via story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation). They tackled the challenges of story-telling with a hierarchical model, which first generated a sentence called “the prompt” (describing the topic for the story), and then “conditioned” on this prompt when generating the story. Conditioning on the prompt or premise made it easier to generate consistent stories, because they provided grounding for the overall plot. It also reduced the tendency of standard sequence models to drift off topic. To improve the relevance of the generated story to its prompt, they adopted a GRU-based fusion mechanism, which pretrains a language model and subsequently trains a seq2seq model with a gating mechanism that learns to leverage the final hidden layer of the language model during seq2seq training. The model showed, for the first time, that fusion mechanisms could help seq2seq models build dependencies between their input and output.

• The gated self-attention mechanism allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context).

• Similar to Google’s Transformer, Facebook AI Research used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (see Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with gated linear unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

Researchers at Peking University (Junyang Lin et al.) recently developed a model that employed a Bi-LSTM decoder in a text summarization task [Global Encoding for Abstractive Summarization (Jun 2018)]. Their approach differed from a similar approach [not cited] by Richard Socher and colleagues at Salesforce, in that Lin et al. fed their encoder output at each time step into a convolutional gated unit, that with a self-attention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global source-side information. Self-attention encouraged the model to learn long-term dependencies, without creating much computational complexity. The gate (based on the generation from the CNN and self-attention module for the source representations from the RNN encoder) could perform global encoding on the encoder outputs. Based on the output of the CNN and self-attention, the logistic sigmoid function outputted a vector of value between 0 and 1 at each dimension. If the value was close to 0, the gate removed most of the information at the corresponding dimension of the source representation, and if it was close to 1 it reserved most of the information. The model thus performed neural abstractive summarization through a global encoding framework, which controlled the information flow from the encoder to the decoder based on the global information of the source context, generating summaries of higher quality while reducing repetition.

In October 2018 Myeongjun Jang and Pilsung Kang at Korea University presented Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition, which introduced their P-thought  model. P-thought employed a seq2seq structure with a gated recurrent unit (GRU) cell. The encoder transformed the sequence of words from an input sentence into a fixed-sized representation vector, whereas the decoder generated the target sentence based on the given sentence representation vector. The P-thought model had two decoders: when the input sentence was given, the first decoder, named “auto-decoder,” generated the input sentence as-is. The second decoder, named “paraphrase-decoder,” generated the paraphrase sentence of the input sentence.

Biomedical event extraction is a crucial task in biomedical text mining. As the primary forum for international evaluation of different biomedical event extraction technologies, the BioNLP Shared Task represents a trend in biomedical text mining toward fine-grained information extraction. The 2016 BioNLP Shared Task (BioNLP-ST 2016) proposed three tasks, in which the “Bacteria Biotope” (BB) event extraction task was added to the previous BioNLP-ST. Biomedical event extraction based on GRU integrating attention mechanism (Aug 2018) proposed a novel gated recurrent unit network framework (integrating an attention mechanism) for extracting biomedical events between biotopes and bacteria from the biomedical literature, utilizing the corpus from the BioNLP-ST 2016 Bacteria Biotope task. The experimental results showed that the presented approach could achieve an $\small F$-score of 57.42% in the test set, outperforming previous state of the art official submissions to BioNLP-ST 2016.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

LSTM, Attention and Gated (Recurrent) Units:

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accept queries in natural language and return accurate answers quickly and efficiently.

Carnegie Mellon University/Google Brain’s QANet : Combining Local Convolution with Global Self-Attention for Reading Comprehension” (Apr 2018) [OpenReviewdiscussion/TensorFlow implementationcode] proposed a method (QANet) that did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD dataset (SQuAD1.1: see the leaderboard), their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data.

• Note that A Fully Attention-Based Information Retriever (Oc 2018) – which failed to cite this earlier, more performant QANet paper/work which scores much higher on the SQuAD1.1 Leaderboard – also employed an entirely convolutional and/or self-attention architecture, which performed satisfactorily on the SQuAD1.1 dataset and was faster to train than RNN-based approaches.

[Image source. Click image to open in new window.]

[Image sources: Table 6Table 3. Click image to open in new window.]

Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension (May 2017; updated Jun 2018) [non-author implementations: MnemonicReader | MRC | MRC-models] performed as well as QANet, outperforming previous systems by over 6% in terms of both Exact Match (EM) and $\small F_1$ metrics on two adversarial SQuAD datasets. Reinforced Mnemonic Reader, based on Bi-LSTM, is an enhanced attention reader with two main contributions: (i) a reattention mechanism, introduced to alleviate the problems of attention redundancy and deficiency in multi-round alignment architectures, and (ii) a dynamic-critical reinforcement learning approach, to address the convergence suppression problem that exists in traditional reinforcement learning methods.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In April 2018 IBM Research introduced a new dataset for reading comprehension (DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension)  [projectdata].  DuoRC is a large scale reading comprehension (RC) dataset of 186K human-generated QA pairs created from 7,680 pairs of parallel movie plots taken from Wikipedia and IMDb. By design, DuoRC ensures very little or no lexical overlap between the questions created from one version and segments containing answers in the other version. Essentially, this is a paraphrase dataset, which should be very useful for training reading comprehension models. For example, the authors observed that state of the art neural reading comprehension models that achieved near human performance on the SQuAD dataset exhibited very poor performance on the DuoRC dataset ($\small F_1$ scores of 37.42% on DuoRC vs. 86% on SQuAD), opening research avenues in which DuoRC could complement other RC datasets exploration of novel neural approaches to studying language understanding.

DuoRC might be a useful dataset for training sentence embedding approaches to natural language tasks such as machine translation, document classification, sentiment analysis, etc. In this regard, note that the Conclusions section in Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition stated: “The main limitation of the current work is that there are insufficient paraphrase sentences for training the models. ”

[Image source. Click image to open in new window.]

In a very thorough and thoughtful analysis, Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (Aug 2018) [code] proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Highlights from this work are [substantially] paraphrased here:

• They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy.

• All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

• The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction.

• The sentence attention allowed them to get more insight into the models’ inner state. For example, it allowed them to check whether the model actually focused on relevant sentences in order to answer the questions. Both model variants [CNN; RNN-LSTM] paid most attention to the relevant plot sentences for 70% of the cases. Identifying the relevant sentences was an important success factor: relevant sentences were ranked highest only in 35% of the incorrectly solved questions.

• Textual entailment was required to solve 60% of the questions …

• The process of elimination and heuristics proved essential to solve 44% of the questions …

• Referential knowledge was presumed in 36% of the questions …

• Furthermore, it was apparent that many questions expected a combination of various reasoning skills.

• In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

• Finally, their intensive analysis on the differences between the model and human inference suggest that both models seem to learn matching patterns to select the right answer rather than performing plausible inferences as humans do. The results of these studies also imply that other human like processing mechanism such as referential relations, implicit real world knowledge, i.e., entailment, and answer by elimination via ranking plausibility Hummel and Holyoak, 2005 should be integrated in the system to further advance machine reading comprehension.

Collectively, those publications indicate the difficulty in achieving robust reading comprehension, and the need to develop new models that understand language more precisely. Addressing this challenge will require employing more difficult datasets (like SQuAD2.0) for various tasks, evaluation metrics that can distinguish real intelligent behavior from shallow pattern matching, a better understanding of the response to adversarial attack, and the development of more sophisticated models that understand language at a deeper level.

• The need for more challenging datasets was echoed in the “Creating harder datasets” subsection in Sebastian Ruder’s ACL 2018 Highlights summary.

In order to evaluate under such settings, more challenging datasets need to be created. Yejin Choi argued during the RepL4NLP panel discussion (a summary can be found here) that the community pays a lot of attention to easier tasks such as SQuAD or bAbI, which are close to solved. Yoav Goldberg even went so far as to say that “SQuAD is the MNIST of NLP ”.

Instead, we should focus on solving harder tasks and develop more datasets with increasing levels of difficulty. If a dataset is too hard, people don’t work on it. In particular, the community should not work on datasets for too long as datasets are getting solved very fast these days; creating novel and more challenging datasets is thus even more important. Two datasets that seek to go beyond SQuAD for reading comprehension were presented at the conference:

Richard Socher also stressed the importance of training and evaluating a model across multiple tasks during his talk during the Machine Reading for Question Answering workshop. In particular, he argues that NLP requires many types of reasoning, e.g. logical, linguistic, emotional, etc., which cannot all be satisfied by a single task.

• Read + Verify: Machine Reading Comprehension with Unanswerable Questions (Sep 2018) proposed a novel read-then-verify system that combined a base neural reader with a sentence-level answer verifier trained to (further) validate if the predicted answer was entailed by input snippets. They also augmented their base reader with two auxiliary losses to better handle answer extraction and no-answer detection respectively, and investigated three different architectures for the answer verifier. On the SQuAD2.0 dataset their system achieved a $\small F_1$ score of 74.8 on the development set (ca. August 2018), outperforming the previous best published model by more than 7 points (and the best reported model by ~3.5 points (2018-08-20: SQuAD2.0 Leaderboard).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Facebook AI Research’s bAbI, a set of 20 tasks for testing text understanding and reasoning described in detail in the paper by Jason Weston et al., Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks (Dec 2015).

• University of Pennsylvania’s MultiRC: Reading Comprehension over Multiple Sentences;  [project (2018) code], a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. The goal of this dataset is to encourage the research community to explore approaches that can do more than sophisticated lexical-level matching.

• Allen Institute for Artificial Intelligence (AI2)’s Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Mar 2018) [projectcode], which presented a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI (Stanford Natural Language Inference Corpus). As noted in their Conclusions:

“Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community.”

• ARC was recently used in Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Scientific Question Answering (Oct 2018) by authors at UC San Diego and Microsoft AI Research. Existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper, the authors proposed a retriever-reader model that learned to attend on [via self-attention layers] essential terms during the question answering process via an essential-term-aware “retriever” which first identified the most important words in a question, then reformulated the queries and searches for related evidence, and an enhanced “reader” to distinguish between essential terms and distracting words to predict the answer. On the ARC dataset their model outperformed the existing state of the art [e.g., BiDAF] by 8.1%.

[image source. click image to open in new window.]

[image source. click image to open in new window.]

[image source. click image to open in new window.]

Among the many approaches to QA applied to textual sources, the attentional long short-term memory (LSTM)-based and Bi-LSTM, memory based implementations of Richard Socher (SalesForce) are particularly impressive:

• Ask Me Anything: Dynamic Memory Networks for Natural Language Processing (Jun 2015; updated Mar 2016) by Richard Socher (MetaMind) introduced the Dynamic Memory Network (DMN), a neural network architecture that processed input sequences and questions, formed episodic memories, and generated relevant answers. Questions triggered an iterative attention process that allowed the model to condition its attention on the inputs and the result of previous iterations. These results were then reasoned over in a hierarchical recurrent sequence model to generate answers. [For an good overview of the DMN approach, see slides 39-47 in Neural Architectures with Memory.]

[Image source. Click image to open in new window.]

• Based on analysis of DMN (above), in 2016 Richard Socher/MetaMind (later acquired by SalesForce) proposed several improvements to the DMN memory and input modules. Their DMN+ model (Dynamic Memory Networks for Visual and Textual Question Answering (Mar 2016) [discussion]) improved the state of the art on visual and text question answering datasets, without supporting fact supervision. Non-author DMN+ code available on GitHub includes a Theano (Improved-Dynamic-Memory-Networks-DMN-plus) and TensorFlow (Dynamic-Memory-Networks-in-TensorFlow) implementations.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Later in 2016, Dynamic Coattention Networks for Question Answering (Nov 2016; updated Mar 2018) [non-author code, on SQuAD2.0] by Richard Socher and colleagues at SalesForce introduced the Dynamic Coattention Network (DCN) for QA. DCN first fused co-dependent representations of the question and the document in order to focus on relevant parts of both, then a dynamic pointing decoder iterated over potential answer spans. This iterative procedure enabled the model to recover from the initial local maxima that correspond to incorrect answers. On the Stanford question answering dataset, a single DCN model improved the previous state of the art from 71.0% $\small F_1$ to 75.9%, while a DCN ensemble obtained a 80.4% $\small F_1$ score.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Efficient and Robust Question Answering from Minimal Context Over Documents (May 2018) studied the minimal context required to answer a question and found that most questions in existing datasets could be answered with a small set of sentences. The authors (Socher and colleagues) proposed a simple sentence selector to select the minimal set of sentences to feed into the QA model, which allowed the system to achieve significant reductions in training (up to 15 times) and inference times (up to 13 times), with accuracy comparable to or better than the state of the art on SQuAD, NewsQA, TriviaQA and SQuAD-Open. Furthermore, the approach was more robust to adversarial inputs.

Note the sentence selector in Fig. 2(a):

“For each QA model, we experiment with three types of inputs. First, we use the full document (FULL). Next, we give the model the oracle sentence containing the groundtruth answer span (ORACLE). Finally, we select sentences using our sentence selector (MINIMAL), using both $\small \text{Top k}$ and $\small \text{Dyn}$. We also compare this last method with TF-IDF method for sentence selection, which selects sentences using n-gram TF-IDF distance between each sentence and the question.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In a significant body of work, The Natural Language Decathlon: Multitask Learning as Question Answering (Jun 2018) [codeproject], Richard Socher and colleagues at SalesForce presented a NLP challenge spanning 10 tasks,

• machine translation
• summarization
• natural language inference
• sentiment analysis
• semantic role labeling
• zero-shot relation extraction
• goal-oriented dialogue
• semantic parsing
• commonsense pronoun resolution

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• While multilayer, pretrained language models (e.g. OpenAI GPTELMoBERT; …) now exist, the decaNLP MQAN  model took shallow-trained, 300-dimensional GloVe embeddings trained on Common-Crawl (words that do not have corresponding GloVe embeddings were assigned zero vectors). They concatenated 100-dimensional character $\small n$-gram embeddings to the GloVe embeddings, giving 400 dimensions.

Understandably, Socher’s work has generated much interest among NLP and ML practitioners, leading to the acquisition of his startup, MetaMind, by Salesforce for $32.8 million in 2016 (Salesforce Reveals It Spent$75 Million on the Three Startups It Bought Last Quarter | Salesforce just bought a machine learning startup that was backed by its CEO Marc Benioff). While those authors will not release the code (per a comment by Richard Socher on reddit), using the search term “1506.07285” there appears to be four repositories on GitHub that attempt to implement his Ask Me Anything: Dynamic Memory Networks for Natural Language Processing model, while a GitHub search for “dynamic memory networks” or “DMN+” returns numerous repositories.

The MemN2N architecture was introduced by Jason Weston (Facebook AI Research) in his highly-cited End-To-End Memory Networks paper [code;  non-author code herehere and here;  discussion here and here]. MemN2N, a recurrent attention model over a possibly large external memory, was trained end-to-end and hence required significantly less supervision during training, making it more generally applicable in realistic settings. The flexibility of the MemN2N model allowed the authors to apply it to tasks as diverse as synthetic question answering (QA) and language modeling (LM). For QA the approach was competitive with memory networks but with less supervision; for LM their approach demonstrated performance comparable to RNN and LSTM on the Penn Treebank and Text8 datasets.

While Weston’s MemN2N model was surpassed (accuracy and tasks completed) on the bAbI English 10k dataset by Socher’s DMN+ – see the “E2E” (End to End) and DMN+ columns in in the DMN+ paper – code is available (links above) for the MemN2N model.

A precaution with high-performing but heavily engineered systems is domain specificity: How well do those models transfer to other applications? I encountered this issue in my preliminary work (not shown) where I carefully examined the Turku Event Extraction System [TEES 2.2: Biomedical Event Extraction for Diverse Corpora (2015)]. TEES preformed well but was heavily engineered to perform well in the various BioNLP Challenge tasks in which it participated. Likewise, a June 2018 comment in the AllenNLP GitHub repository, regarding end-to-end memory networks, is of interest:

• “Why are you guys not using *Dynamic Memory Networks in any of your QA solutions?*

I’m not a huge fan of the models called “memory networks” – in general they are too tuned to a completely artificial task, and they don’t work well on real data. I implemented the end-to-end memory network ”, for instance, and it has three separate embedding layers (which is absolutely absurd if you want to apply it to real data).

@DeNeutoy implemented the DMN+. It’s not as egregious as the E2EMN [end-to-end memory network], but still, I’d look at actual papers, not blogs, when deciding what methods actually work. E.g., are there any memory networks on the SQuAD Leaderboard (https://rajpurkar.github.io/SQuAD-explorer/)? On the TriviaQA leaderboard? On the leaderboard of any recent, popular dataset?

To be fair, more recent “memory networks” have modified their architectures so they’re a lot more similar to things like the gated attention reader, which has actually performed well on real data. But, it sure seems like no one is using them to accomplish state of the art QA on real data these days.”

I believe that the “gated attention reader” mentioned in that comment (above) refers to Gated-Attention Readers for Text Comprehension] (Jun 2016; updated Apr 2017) by Ruslan Salakhutdinov and colleagues.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “Our Retrieve-and-Read model is based on the bi-directional attention flow (BiDAF ) model, which is a standard RC model. As shown in Figure 2 [above] it consists of six layers: … We note that the RC component trained with single-task learning is essentially equivalent to BiDAF, except for the word embedding layer that has been modified to improve accuracy. … Note that the original BiDAF uses a pre-trained GloVe and also trains character-level embeddings by using a CNN in order to handle out-of-vocabulary (OOV) or rare words. Instead of using GloVe and CNN, our model uses fastText for the fixed pre-trained word vectors and removes character-level embeddings. The fastText model takes into account subword information and can obtain valid representations even for OOV words.”

In 2016 the Allen Institute for Artificial Intelligence introduced the Bi-Directional Attention Flow (BiDAF) framework (Bidirectional Attention Flow for Machine Comprehension (Nov 2016; updated Jun 2018) [projectcodedemo]). BiDAF was a multi-stage hierarchical process that represented context at different levels of granularity and used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.

BiDAF was subsequently used in QA4IE : A Question Answering based Framework for Information Extraction (Apr 2018) [note ; code], a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, along with a knowledge base (Wikipedia Ontology) for entity recognition. QA4IE processed entire documents as a whole, rather than separately processing individual sentences. Because QA4IE was designed to produce sequence answers in IE settings, QA4IE was outperformed by BiDAF on the SQuAD dataset ( in QA4IE). Conversely, QA4IE outperformed QA systems – including BiDAF – across 6 datasets in IE settings ( in QA4IE).

BiDAF:

[BiDAF. Image source. Click image to open in new window.]

QA4IE:

[QA4IE. Image source. Click image to open in new window.]

[QA4IE. Image source. Click image to open in new window.]

[QA4IE. Image source. Click image to open in new window.]

• A major difference between question answering (QA) settings and information extraction settings is that in QA settings each query corresponds to an answer, while in the QA4IE framework the QA model takes a candidate entity-relation (or entity-property) pair as the query and it needs to tell whether an answer to the query can be found in the input text.

In other work relating to Bi-LSTM-based question answering, IBM Research and IBM Watson published a paper, Improved Neural Relation Detection for Knowledge Base Question Answering (May 2017), which focused on relation detection via deep residual Bi-LSTM networks to compare questions and relation names. The approach broke the relation names into word sequences for question-relation matching, built both relation level and word level relation representations, used deep BiLSTMs to learn different levels of question representations in order to match the different levels of relation information, and finally used a residual learning method for sequence matching. This made the model easier to train and resulted in more abstract (deeper) question representations, thus improving hierarchical matching. Several non-Microsoft implementations are available on GitHub (machine-comprehension; machine-reading-comprehension; and most recently, MSMARCO).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Making Neural QA as Simple as Possible but Not Simpler (Mar 2017; updated Jun 2017) introduced FastQA, a simple, context/type matching heuristic for extractive question answering. The paper posited that two simple ingredients are necessary for building a competitive QA system: (i) awareness of the question words while processing the context, and (ii) a composition function (such as recurrent neural networks) which goes beyond simple bag-of-words modeling. In follow-on work, these authors applied FastQA to the biomedical domain (Neural Domain Adaptation for Biomedical Question Answering;  [code]). Their system – which did not rely on domain-specific ontologies, parsers or entity taggers – achieved state of the art results on factoid questions, and competitive results on list questions.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

A recent review, Comparative Analysis of Neural QA Models on SQuAD (Jun 2018), reported that models based on a gated attention mechanism (R-Net ), or a GRU (DocQA ), performed well across a variety of tasks.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Likewise (regarding evidence based answering), textual entailment with neural attention methods could also be applied; for example, as described in DeepMind’s Reasoning about Entailment with Neural Attention.

[Image source. Click image to open in new window.]

In March 2018 Studio Ousia published a question answering model, Studio Ousia’s Quiz Bowl Question Answering System  [slidesmedia]. The embedding approach described in that paper was very impressive, with the ability to “reason” over passages such as the one shown in Table 1 [presented in the summary images, below]. Trained on their Wikipedia2Vec (Wikipedia) pretrained word embeddings, this model very convincingly won the Human-Computer Question Answering Competition (HCQA) at NIPS 2017, scoring more than double the combined human team score (465 to 200 points). A commercial entity, there was no code release.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Final: Human: 200 : Computer: 465 points. Click image to open in new window.]

In June 2018 Studio Ousia and colleagues at the Nara Institute of Science and Technology, RIKEN AIP, and Keio University published Representation Learning of Entities and Documents from Knowledge Base Descriptions  [code], which described TextEnt, a neural network model that learned distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, they trained their model to predict the entity that the document described, and map the document and its target entity close to each other in a continuous vector space. Their model, which was trained using a large number of documents extracted from Wikipedia, was evaluated using two tasks: (i) fine-grained entity typing, and (ii) multiclass text classification. The results demonstrated that their model achieved state of-the-art performance on both tasks.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Based on the model architectures (above/below), it appears the Studio Ousia Quiz Bowel is based on their TextEnt work?

[Image sourceImage source 2. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Interpreting phrases such as “Who did what to whom? ” is a major focus in natural language understanding, specifically, semantic role labeling. I Know What You Want: Semantic Learning for Text Comprehension (Sep 2018) attempted to employ semantic role labeling to enhance text comprehension and natural language inference through specifying verbal arguments and their corresponding semantic roles. Embeddings were enhanced by semantic role labels, giving more fine-grained semantics: the salient labels could be conveniently added to existing models, significantly improving deep learning models in challenging text comprehension tasks. This work showed the effectiveness of semantic role labeling in text comprehension and natural language inference, and proposed an easy and feasible scheme to integrate semantic role labeling information in neural models. Experiments on benchmark machine reading comprehension and inference datasets verified that the proposed semantic learning helped their system attain a significant improvement over state of the art, baseline models. [“We will make our code and source publicly available soon.” Not available, 2018-10-16.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

ELMo word embeddings were employed. The dimension of embedding was a critical hyperparameter that influenced the performance: too high of a dimension caused severe overfitting issues, while a dimension that was too low caused underfitting; 5-dimension semantic role label embedding gave the best performance on both the SNLI and SQuAD datasets.

As employed in I Know What You Want: Semantic Learning for Text Comprehension, ELMo embeddings were also used by Ouchi et al. in A Span Selection Model for Semantic Role Labeling (Oct 2018) [code] in a Bi-LSTM, span-based model that employed a IOB/BIO tagging approach. Typically, in this approach, models firstly identify candidate argument spans (argument identification) and then classify each span into one of the semantic role labels (argument classification). In related recent work (Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling), He et al.(2018) also proposed a span-based SRL model similar to Ouchi et al.’s I Know What You Want: Semantic Learning for Text Comprehension. While He et al. also used Bi-LSTM to induce span representations in an end-to-end fashion, a main difference was that while they model $\small P(r | i,j)$, Ouchi et al. modeled $\small P(i,j | r)$. In other words, while He et al.’s model sought to select an appropriate label for each span (label selection), Ouchi et al.’s model selected appropriate spans for each label (span selection).

Ouchi et al.:

[Image source. Click image to open in new window.]

He et al.:

[Image source. Click image to open in new window.]

Model/results – Ouchi et al.:

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Model/results – He et al.:

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Question Answering by Reasoning Across Documents with Graph Convolutional Networks (Aug 2018) introduced a method (Entity-GCN ) which reasons on information spread within/across documents, framed as an inference problem on a graph. Their approach differed from BiDAF and FastQA, which merely concatenate all documents into a single long text and train a standard reading comprehension model. Instead, they framed question answering as an inference problem on a graph representing the document collection.

Machine reading comprehension with unanswerable questions is a new challenging task for natural language processing. A key subtask is to reliably predict whether the question is unanswerable. U-Net: Machine Reading Comprehension with Unanswerable Questions (Oct 2018) proposed a unified model (U-Net) with three important components: answer pointer, no-answer pointer, and answer verifier. They introduced a universal node and thus processed the question and its context passage as a single contiguous sequence of tokens. The universal node encoded the fused information from both the question and passage, and played an important role in predicting whether the question was answerable. Different from other state of the art pipeline models, U-Net could be learned in an end-to-end fashion. Experimental results on the SQuAD2.0 dataset showed that U-Net could effectively predict the unanswerability of questions, achieving an $\small F_1$ score of 71.7 on SQuAD2.0.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

“Our model achieves an $\small F_1$ score of 74.0 and an EM score of 70.3 on the development set, and an $\small F_1$ score of 72.6 and an EM score of 69.2 on Test set 1, as shown in Table 2. Our model outperforms most of the previous approaches. Comparing to the best-performing systems, our model has a simple architecture and is an end-to-end model. In fact, among all the end-to-end models, we achieve the best $\small F_1$ scores. We believe that the performance of the U-Net can be boosted with an additional post-processing step to verify answers using approaches such as (Hu et al. 2018).”

Text embedding representing natural language documents in a semantic vector space can be used for document retrieval using nearest neighbor lookup. Text Embeddings for Retrieval From a Large Knowledge Base (Oct 2018; Christian Szegedy at Google Inc. and authors at the University of Arkansas) studied the feasibility of neural models specialized for retrieval in a semantically meaningful way, suggesting the use of SQuAD in an open-domain question answering context where the first task was to find paragraphs useful for answering a given question. They first compared the quality of various text-embedding methods on the performance of retrieval, and gave an empirical comparisons on the performance of various non-augmented base embeddings with/without IDF weighting. Training deep residual neural models specifically for retrieval purposes yielded significant gains when it was used to augment existing embeddings. They also established that deeper models were superior to this task. The best base baseline embeddings augmented by their learned neural approach improved the top-1 paragraph recall of the system by 14%.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Improving Machine Reading Comprehension with General Reading Strategies (Oct 2018) proposed three simple domain-independent strategies aimed to improve non-extractive machine reading comprehension (MRC):

• BACK AND FORTH READING, which considers both the original and reverse order of an input sequence,
• HIGHLIGHTING, which adds a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers, and
• SELF-ASSESSMENT, which generates practice questions and candidate answers directly from the text in an unsupervised manner.

“By fine-tuning a pre-trained language model (Radford et al., 2018) [OpenAI’s GPT: Generative Pre-Trained Transformer] with our proposed strategies on the largest existing general domain multiple-choice MRC dataset RACE, we obtain a 5.8% absolute increase in accuracy over the previous best result achieved by the same pre-trained model fine-tuned on RACE without the use of strategies. We further fine-tune the resulting model on a target task, leading to new state-of-the-art results on six representative non-extractive MRC datasets from different domains (i.e., ARC, OpenBookQA, MCTest, MultiRC, SemEval-2018, and ROCStories). These results indicate the effectiveness of the proposed strategies and the versatility and general applicability of our fine-tuned models that incorporate these strategies.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Machine Reading Comprehension (MRC) with multiple-choice questions requires the machine to read given passage and select the correct answer among several candidates. Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions (Nov 2018) proposed a novel approach called Convolutional Spatial Attention (CSA) which could better handle the MRC with multiple-choice questions. The proposed model could fully extract the mutual information among the passage, question, and the candidates, to form the enriched representations. Furthermore, to merge various attention results, they proposed to use convolutional operations to dynamically summarize the attention values within the different size of regions. Experimental results showed that the proposed model could give substantial improvements over various state of the art systems on both the RACE and SemEval-2018 Task11 datasets.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

A Deep Cascade Model for Multi-Document Reading Comprehension (Nov 2018) developed a novel deep cascade learning model, which progressively evolved from the document-level and paragraph-level ranking of candidate texts to more precise answer extraction with machine reading comprehension. Irrelevant documents and paragraphs were first filtered out with simple functions for efficiency consideration. They then jointly trained three modules on the remaining texts for better tracking the answer: document extraction, paragraph extraction and answer extraction. Experiment results showed that the proposed method outperformed the previous state of the art methods on two large-scale multi-document benchmark datasets: TriviaQA and DuReader. Their online system could stably serve millions of daily requests in less than 50ms.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Commonsense for Generative Multi-Hop Question Answering Tasks (Sep 2018, updated Jan 2019) [code] focused on a challenging multi-hop generative task (NarrativeQA), which required the model to reason, gather, and synthesize disjoint pieces of information within the context to generate an answer. This type of multi-step (multi-hop) reasoning also often requires understanding implicit relations, which humans resolve via external, background commonsense knowledge. They first presented a strong generative baseline that used a multi-attention mechanism to perform multiple hops of reasoning and a pointer-generator decoder to synthesize the answer. This model substantially outperformed previous generative models, and was competitive with current state of the art span prediction models. They next introduced a novel system for selecting grounded multi-hop relational commonsense information from ConceptNet via a pointwise mutual information and term-frequency based scoring function. Finally, they effectively used this extracted commonsense information to fill in gaps of reasoning between context hops, using a selectively-gated attention mechanism. This boosted the model’s performance significantly (verified by human evaluation), establishing a new state of the art for the task. They also showed promising initial results of the generalizability of their background knowledge enhancements by demonstrating some improvement on QAngaroo-WikiHop, another multi-hop reasoning dataset.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of TQA dataset. First, it requires to comprehend long lessons to extract knowledge. To tackle this issue of extracting knowledge features from long lessons, we establish knowledge graph from texts and incorporate graph convolutional network (GCN). Second, scientific terms are not spread over the chapters and data splits in TQA dataset. To overcome this so called `out-of-domain’ issue, we add novel unsupervised text learning process without any annotations before learning QA problems. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating GCN for extracting knowledge from long lessons and our newly proposed unsupervised learning process are meaningful to solve this problem.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• “Despite the great success of word embedding, sentence embedding remains a not-well-solved problem. In this paper, we present a supervised learning framework to exploit sentence embedding for the medical question answering task. The learning framework consists of two main parts: (1) a sentence embedding producing module, and (2) a scoring module. The former is developed with contextual self-attention and multi-scale techniques to encode a sentence into an embedding tensor. This module is shortly called Contextual self-Attention Multi-scale Sentence Embedding (CAMSE). The latter employs two scoring strategies: Semantic Matching Scoring (SMS) and Semantic Association Scoring (SAS). SMS measures similarity while SAS captures association between sentence pairs: a medical question concatenated with a candidate choice, and a piece of corresponding supportive evidence. The proposed framework is examined by two Medical Question Answering (MedicalQA) datasets which are collected from real-world applications: medical exam and clinical diagnosis based on electronic medical records (EMR). The comparison results show that our proposed framework achieved significant improvements compared to competitive baseline approaches. Additionally, a series of controlled experiments are also conducted to illustrate that the multi-scale strategy and the contextual self-attention layer play important roles for producing effective sentence embedding, and the two kinds of scoring strategies are highly complementary to each other for question answering problems.”

• Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering (Jan 2109), by Richard Socher et al.

“Any system which performs goal-directed continual learning must not only learn incrementally but process and absorb information incrementally. Such a system also has to understand when its goals have been achieved. In this paper, we consider these issues in the context of question answering. Current state-of-the-art question answering models reason over an entire passage, not incrementally. As we will show, naive approaches to incremental reading, such as restriction to unidirectional language models in the model, perform poorly. We present extensions to the DocQA model to allow incremental reading without loss of accuracy. The model also jointly learns to provide the best answer given the text that is seen so far and predict whether this best-so-far answer is sufficient.”

“… we propose a model that reads and comprehends text incrementally. As a testbed for our approach, we have chosen the question answering task. We aim to build a model that can learn 4 incrementally from text, where the learning goal is to answer a given question. In standard question answering, we do not care how the context is presented to the model, and for the models that achieve state of the art results, e.g. [11, 2], they process the full context before making any decisions. We show that it is possible to modify these models to be incremental while achieving similar performance. Having an incremental model, allows us to employ an early stopping strategy where the model avoids reading the rest of the text as soon as it reaches a state where it thinks it has the answer.”

“We will open source the code for reproducing the experiments.”

### Multi-Hop Reasoning (Multi-Step Inference)

Here I collate material relating to multi-hop mechanisms (multi-hop / multi-step  reasoning / inference).

Most reading comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. Constructing Datasets for Multi-hop Reading Comprehension Across Documents (Oct 2017; updated Jun 2018) proposed a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In their task, a model learned to seek and combine evidence – effectively performing multi-hop (alias multi-step) inference. They devised a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains were induced, and they identified potential pitfalls and devised circumvention strategies. They evaluated two previously proposed competitive models and found that one can integrate information across documents. However, both models struggled to select relevant information, as providing documents guaranteed to be relevant greatly improved their performance. While the models outperformed several strong baselines, their best accuracy reached 42.9% compared to human performance at 74.0% – leaving ample room for improvement.

[Image source. Click image to open in new window.]

• Project: “We have created two new Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. Several pieces of information often jointly imply another fact. In multi-hop inference, a new fact is derived by combining facts via a chain of multiple steps. Our aim is to build Reading Comprehension methods that perform multi-hop inference on text, where individual facts are spread out across different documents. The two QAngaroo datasets provide a training and evaluation resource for such methods.

QAngaroo focuses on reading comprehension that requires the gathering of several pieces of information via multiple steps of inference. “We define a novel RC [reading comprehension] task in which a model should learn to answer queries by combining evidence stated across documents. We introduce a methodology to induce datasets for this task and derive two datasets.

[Image source. Click image to open in new window.]

• Datasets:

• WikiHop. The first of the two datasets is open-domain and based on Wikipedia articles; the goal is to recover Wikidata information by hopping through documents. WikiHop uses sets of Wikipedia articles where answers to queries about specific properties of an entity cannot be located in the entity’s article. The example on the right shows the relevant documents leading to the correct answer for the query shown at the bottom.

[Image source. Click image to open in new window.]

• MedHop. With the same format as WikiHop, this dataset is based on research paper abstracts from PubMed, and the queries are about interactions between pairs of drugs. The correct answer has to be inferred by combining information from a chain of reactions of drugs and proteins. In MedHop the goal is to establish drug-drug interactions based on scientific findings about drugs and proteins and their interactions, found across multiple Medline abstracts.

[Image source. Click image to open in new window.]

Gated-Attention Readers for Text Comprehension (Jun 2016; updated Apr 2017) [Theano codeupdated (TensorFlow) codeupdated (PyTorch) code], by Ruslan Salakhutdinov and colleagues, presented their Gated-Attention Reader model. GA Reader integrated a multi-hop architecture with a novel attention mechanism, which was based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enabled the reader to build query-specific representations of tokens in the document for accurate answer selection.

“Deep learning models have been shown to outperform traditional shallow approaches on text comprehension tasks. The success of many recent models can be attributed primarily to two factors:

1. Multi-hop architectures allow a model to scan the document and the question iteratively for multiple passes.
2. Attention mechanisms, borrowed from the machine translation literature, allow the model to focus on appropriate subparts of the context document.

Intuitively, the multi-hop architecture allows the reader to incrementally refine token representations, and the attention mechanism re-weights different parts in the document according to their relevance to the query.

[Image source. Click image to open in new window.]

Variational Reasoning for Question Answering with Knowledge Graph (Nov 2017) provided a unified deep learning architecture and an end-to-end variational learning algorithm that could simultaneously handle noise in questions, while learning multi-hop reasoning. …

DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning (Jul 2018) studied the problem of learning to reason in large scale knowledge graphs. They described a novel reinforcement learning framework for learning multi-hop relational paths, using a policy-based agent with continuous states based on knowledge graph embeddings, which reasoned in a knowledge graph vector space by sampling the most promising relation to extend its path. … Their method outperformed a path-ranking based algorithm and knowledge graph embedding methods on Freebase and Never-Ending Language Learning datasets.

[Image source. Click image to open in new window.]

Richard Socher and colleagues at SalesForce employed a multi-hop reasoning approach to question answering over incomplete knowledge graphs [Multi-Hop Knowledge Graph Reasoning with Reward Shaping) (Sep 2018) [code]. …

[Image source. Click image to open in new window.]

• While embedding-based approaches (upper portion of that table, above; e.g., ConvE) performed better, ConvE hasn’t been used (to my knowledge) for question answering. In their paper, Socher and colleagues note the following.

"We find embedding based models perform strongly on several datasets, achieving overall best evaluation metrics on UMLS, Kinship, FB15k-237 and NELL-995 despite their simplicity. While previous path based approaches achieve comparable performance on some of the datasets (WN18RR, NELL-995, and UMLS), the performance gaps to the embedding based models on the other datasets (Kinship and FB15k-237) are considerable (9.1 and 14.2 absolute points respectively).

"A possible reason for this is that embedding based methods map every link in the KG into the same embedding space, which implicitly encodes the connectivity of the whole graph. In contrast, path based models use the discrete representation of a KG as input, and therefore have to leave out a significant proportion of the combinatorial path space by selection. For some path based approaches, computation cost is a bottleneck. In particular, Neural LP and NTP-λ failed to scale to the larger datasets and their results are omitted from the table, as Das et al. (2018) reported.

"Ours is the first multi-hop reasoning approach which is consistently comparable or better than embedding based approaches on all five datasets."

Comment: their multi-hop model performed reasonably well, but not quite as well as the embedding-based approaches. Note also that Socher et al. employed a reinforcement learning approach, that may be more difficult to train.

Most research in reading comprehension has focused on answering questions based on individual documents or even single paragraphs. Question Answering by Reasoning Across Documents with Graph Convolutional Networks (Aug 2018) by Cao et al. introduced a method (Entity-GCN) which integrated and reasoned relying on information spread within documents and across multiple documents. They framed this task as an inference problem on a graph, with mentions of entities as nodes and edges encoding relations between different mentions. Graph convolutional networks (GCN) were applied to these graphs and trained to perform multi-step reasoning. Multi-hop reading comprehension focuses on one type of factoid question, where a system needs to properly integrate multiple pieces of evidence to correctly answer a question. Each step of the algorithm (also referred to as a hop) updates all node representations in parallel. …

[Image source. Click image to open in new window.]

Related to the work described in Question Answering by Reasoning Across Documents with Graph Convolutional Networks (above), Exploring Graph-structured Passage Representation for Multi-hop Reading Comprehension with Graph Neural Networks (Sep 2018) also employed a multi-hop reading comprehension approach on the WikiHop dataset. … In this present work, the authors introduced a new method for better connecting global evidence, which formed more complex graphs compared to DAGs. To perform evidence integration on their graphs, they investigated two recent graph neural networks: graph convolutional networks (GCN), and graph recurrent networks (GRN). After obtaining the representation vectors for question and entity mentions in passages, an additive attention model (Bahdanau et al., 2015) was adopted, treating all entity mention representations and the question representation as the memory and the query, respectively. Word embeddings were initialized from the 300-dimensional pretrained GloVe word embeddings.

Graph convolutional networks (GCN) have been able to achieve state of the art results in the task of node classification; however, since GCN relies on the localized first-order approximations of spectral graph convolutions, it is unable to capture higher-order interactions between nodes in the graph. Higher-order Graph Convolutional Networks (Sep 2018) proposed a graph attention model called Motif Convolutional Networks (MCN), which generalized past approaches by using weighted multi-hop motif adjacency matrices to capture higher-order neighborhoods. A novel attention mechanism was used, allowing each individual node to select the most relevant neighborhood to apply its filter. …

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Most existing knowledge graph completion methods either focus on the positional relationship between entity pair and single relation (1-hop path) in semantic space, or concentrate on the joint probability of random walks on multi-hop paths among entities. However, they do not fully consider the intrinsic relationships of all the links among entities. By observing that the single relation and multi-hop paths between the same entity pair generally contain shared/similar semantic information, Hierarchical Attention Networks for Knowledge Base Completion via Joint Adversarial Training (Oct 2018) proposed a novel method for KB completion, which captured the features shared by different data sources utilizing hierarchical attention networks (HAN) and adversarial training (AT). … The AT mechanism encouraged their model to extract features that were both discriminative for missing relation prediction, and shareable between single relation and multi-hop paths.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Exploiting Explicit Paths for Multi-hop Reading Comprehension (Nov 2018) focused on the task of multi-hop reading comprehension where a system was required to reason over a chain of multiple facts, distributed across multiple passages, to answer a question. Inspired by graph-based reasoning, they presented a path-based reasoning approach for textual reading comprehension, which operated by generating potential paths across multiple passages, extracting implicit relations along this path, and composing them to encode each path. The proposed model achieved a 2.3% gain on the WikiHop Dev set as compared to previous state of the art, and was also able to explain its reasoning through explicit paths of sentences.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Commonsense for Generative Multi-Hop Question Answering Tasks (Sep 2018, updated Jan 2019) [code] focused on a challenging multi-hop generative task (NarrativeQA), which required the model to reason, gather, and synthesize disjoint pieces of information within the context to generate an answer. This type of multi-step (multi-hop) reasoning also often requires understanding implicit relations, which humans resolve via external, background commonsense knowledge. …

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering (Jan 2019) [OpenReview], by Richard Socher and colleagues at Salesforce, proposed the Coarse-grain Fine-grain Coattention Network (CFC), a new question answering model that combined information from evidence across multiple documents. CFC consisted of a coarse-grain module that interpreted documents with respect to the query then found a relevant answer, and a fine-grain module which scored each candidate answer by comparing its occurrences across all of the documents with the query. They designed those modules using hierarchies of coattention and self-attention, which learned to emphasize different parts of the input. On the Qangaroo WikiHop multi-evidence question answering task, CFC obtained [Sep 2018; surpassed Nov 2018] a new state of the art result of 70.6% on the blind test set, outperforming the previous best by 3% accuracy despite not using pretrained contextual encoders.

[Image source. Click image to open in new window.]

Recently, progress has been made towards improving relational reasoning in machine learning. Among existing models, graph neural networks (GNNs) are one of the most effective approaches for multi-hop relational reasoning, which is indispensable in many natural language processing tasks such as relation extraction. Graph Neural Networks with Generated Parameters for Relation Extraction (Feb 2019) [OpenReview] proposed to generate the parameters of graph neural networks (GP-GNNs) according to natural language sentences, which enabled GNNs to process relational reasoning on unstructured text inputs. … A qualitative analysis demonstrated that their model could discover more accurate relations by multi-hop relational reasoning.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

### Probing the Nature (Transparency) of Reasoning Architectures

Compositional Attention Networks for Machine Reasoning (Apr 2018) [code] by Drew Hudson and Christopher Manning presented the MAC network, a novel fully differentiable neural network architecture designed to facilitate explicit and expressive reasoning. MAC moved away from monolithic black-box neural architectures toward a design that encouraged both transparency and versatility. The model approached problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintained a separation between control and memory. By stringing the cells together and imposing structural constraints that regulated their interaction, MAC effectively learned to perform iterative reasoning processes that were directly inferred from the data in an end-to-end approach. They demonstrated the model’s strength, robustness and interpretability on the challenging CLEVR dataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy, halving the error rate of the previous best model. More importantly, they showed that the model was computationally efficient and data efficient, in particular requiring 5x less data than existing models to achieve strong results.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Manning’s MAC Net is a compositional attention network designed for visual question answering (VQA). In a very similar approach [Compositional Attention Networks for Interpretability in Natural Language Question Answering (Oct 2018)], Saama AI Research (India) proposed a modified MAC Net architecture for natural language question answering. Question Answering typically requires language understanding and multistep reasoning. MAC Net’s unique architecture – the separation between memory and control – facilitated data-driven iterative reasoning, making it an ideal candidate for solving tasks that involve logical reasoning. Experiments with the 20 bAbI tasks demonstrated the value of MAC Net as a data efficient and interpretable architecture for natural language question answering. The transparent nature of MAC Net provided a highly granular view of the reasoning steps taken by the network in answering a query.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

### Probing the Shortcomings of Shallow Trained Language Models

While on the surface LSTM based approaches generally appear to perform well for memory and recall, upon deeper inspection they can also display significant limitations. For example, around mid-2018 I conducted a cursory examination of the BiDAF/SQuAD question answering model online demo  [alternate site], in which I found that their BiDAF model performed well on some queries but failed on other semantically and syntactically identical questions (e.g. with changes in character case, or punctuation), as well as queries on entities not present in the text. While BiDAF a employed hierarchical multi-stage process consisting of six layers (character embedding, word embedding, contextual embedding, attention flow, modeling and output layers), it employed GloVe pretrained word vectors for the word embedding layer to map each word to a high-dimensional vector space (a fixed embedding of each word). This led me to suspect that the shallow embeddings encoded in the GloVe pretrained word vectors failed to capture the nuances of processed text.

At first glance, the issues I identified in my BiDAF /SQuAD tests were suggestive of differences in one or more of the various embedding, attention flow, modeling, and output layers in the BiDAF model (see, e.g., Section 2 in the BiDAF paper) that do not transfer well to the biomedical abstract and questions posed. As well, it is noted (Appendix B in The Natural Language Decathlon: Multitask Learning as Question Answering paper that data in SQuAD is lowercased. Note also that in that “decaNLP” paper shows which component of their MQAN system chooses to output answers. Appendix A in the decaNLP paper also discusses aspects of SQuAD (paraphrased here) that may be relevant to my observations:

Excerpted/paraphrased from NLP’s ImageNet moment has arrived (Jul 2018 blog post by Sebastian Ruder):

• “Pretrained word vectors have brought NLP a long way. Proposed in 2013 as an approximation to language modeling, word2vec found adoption through its efficiency and ease of use … word embeddings pretrained on large amounts of unlabeled data via algorithms such as word2vec and GloVe are used to initialize the first layer of a neural network, the rest of which is then trained on data of a particular task. … Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.

“Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

“At the core of the recent advances of ULMFiT, ELMo, and the OpenAI GPT: Generative Pre-Trained Transformer is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.”

Recent discussions by Stanford University researchers in Adversarial Examples for Evaluating Reading Comprehension Systems (Jul 2017) [codeworksheetsdiscussion] also provide some insight.

"... we sought to understand the factors that influence whether the model will be robust to adversarial perturbations on a particular example. First, we found that models do well when the question has an exact $\small n$-gram match with the original paragraph. plots the fraction of examples for which an $\small n$-gram in the question appears verbatim in the original passage; this is much higher for model successes. For example, 41.5% of *BiDAF Ensemble* successes had a 4-gram in common with the original paragraph, compared to only 21.0% of model failures."

Stanford (2017):

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Unquestionably, LSTM based language models have been important drivers of progress in NLP, as reviewed in

LSTM are commonly employed for textual summarization, question answering, natural language understanding, natural language inference, and commonsense reasoning tasks. Increasingly however, NLP researchers and practitioners have questioning both the relevance and performance of RNN/LSTM as models for learning natural language. In this regard, Sebastian Ruder included these comments in his recent post, ACL 2018 highlights:

Another way to gain a better understanding of a [NLP] model is to analyze its inductive bias. The “Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP” (RELNLP) sought to explore how useful it is to incorporate linguistic structure into our models. One of the key points of Chris Dyer’s talk during the workshop was whether RNNs have a useful inductive bias for NLP. In particular, he argued that there are several pieces of evidence indicating that RNNs prefer sequential recency, namely:

• Gradients become attenuated across time. LSTMs or GRUs may help with this, but they also forget.
• People have used training regimes like reversing the input sequence for machine translation.
• People have used enhancements like attention to have direct connections back in time.
• For modeling subject-verb agreement, the error rate increases with the number of attractors [Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies (Nov 2016)]

According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don’t seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour. Recurrent neural network grammars, a class of models that generates both a tree and a sequence sequentially by compressing a sentence into its constituents, instead have a bias for syntactic (rather than sequential) recency [Recurrent Neural Network Grammars (Oct 2016)]. However, it can often be hard to identify whether a model has a useful inductive bias. For identifying subject-verb agreement, Chris hypothesizes that LSTM language models learn a non-structural “first noun” heuristic that relies on matching the verb to the first noun in the sentence. In general, perplexity (and other aggregate metrics) are correlated with syntactic/structural competence, but are not particularly sensitive at distinguishing structurally sensitive models from models that use a simpler heuristic.

Understanding the failure modes of LSTMs

Better understanding representations was also a theme at the Representation Learning for NLP workshop. During his talk, Yoav Goldberg detailed some of the efforts of his group to better understand representations of RNNs. In particular, he discussed his recent work on extracting a finite state automaton from an RNN in order to better understand what the model has learned [Weiss, Goldberg & Yahav: Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples (Jun 2018) … “In this work, however, we will focus on GRUs (Cho et al., 2014; Chung et al., 2014) and LSTMs (Hochreiter & Schmidhuber, 1997), as they are more widely used in practice.”]. He also reminded the audience that LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data. Even when a model has been trained using a domain-adversarial loss to produce representations that are invariant of a certain aspect, the representations will be still slightly predictive of said attribute. It can thus be a challenge to completely remove unwanted information from encoded language data and even seemingly perfect LSTM models may have hidden failure modes. On the topic of failure modes of LSTMs, a statement that also fits well in this theme was uttered by this year’s recipient of the ACL lifetime achievement award, Mark Steedman. He asked ‘LSTMs work in practice, but can they work in theory?’

A UC-Berkeley paper by John Miller and Moritz Hardt, When Recurrent Models Don’t Need To Be Recurrent (May 2018) [author’s discussiondiscussion), studied the gap between recurrent and feedforward models trained using gradient descent. They proved that stable RNN are well approximated by feedforward networks for the purpose of both inference and training by gradient descent. If the recurrent model is stable (meaning the gradients can not explode), then the model can be well-approximated by a feedforward network for the purposes of both inference and training. In other words, they showed feedforward and stable recurrent models trained by gradient descent are equivalent in the sense of making identical predictions at test-time. [Of course, not all models trained in practice are stable: they also gave empirical evidence the stability condition could be imposed on certain recurrent models without loss in performance.]

Autoregressive, feed-forward model: Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts $\small y_t$ using only the $\small k$ most recent inputs, $\small x_{t-k+1}, \ldots, x_t$. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the $\small k$ most recent inputs. Google’s WaveNet nicely illustrates this general principle. [Source: When Recurrent Models Don’t Need to be Recurrent]

[Image source. Click image to open in new window.]

Recurrent models feature flexibility and expressivity that come at a cost. Empirical experience shows that RNNs are often more delicate to tune and more brittle to train than standard feedforward architectures. Recurrent architectures can also introduce significant computational burden compared with feedforward implementations. In response to these shortcomings, a growing line of empirical research demonstrates that replacing recurrent models by feedforward models is effective in important applications including translation, speech synthesis, and language modeling (When Recurrent Models Don't Need To Be Recurrent). In contrast to an RNN, the limited context of a feedforward model means that it cannot capture patterns that extend more than $\small k$ steps. Although it appears that the trainability and parallelization for feedforward models comes at the price of reduced accuracy, there have been several recent examples showing that feedforward networks can actually achieve the same accuracies as their recurrent counterparts on benchmark tasks, including language modeling, machine translation, and speech synthesis.

In regard to language modeling – in which the goal is to predict the next word in a document given all of the previous words – feedforward models make predictions using only the $\small k$ most recent words, whereas recurrent models can potentially use the entire document. The gated-convolutional language model is a feedforward autoregressive model that is competitive with large LSTM baseline models. Despite using a truncation length of $\small k = 25$, the model outperforms a large LSTM on the WikiText-103 benchmark, which is designed to reward models that capture long-term dependencies. On the Billion Word Benchmark, the model is slightly worse than the largest LSTM, but is faster to train and uses fewer resources. This is perplexing, since recurrent models seem to be more powerful a priori.

When Recurrent Models Don't Need To Be Recurrent coauthor John Miller continues this discussion in his excellent blog post:

• One explanation for this phenomenon is given by Dauphin et al. in Language Modeling with Gated Convolutional Networks (Sep 2017):

ttention and Gated (Recurrent) Units

[Image source. Click image to open in new window.]

• From that paper:

Gating has been shown to be essential for recurrent neural networks to reach state-of-the-art performance. Our gated linear units reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities (Section 5.2). We show that gated convolutional networks outperform other recently published language models such as LSTMs trained in a similar setting on the Google Billion Word Benchmark (Chelba et al., 2013). …

“Gating mechanisms control the path through which information flows in the network and have proven to be useful for recurrent neural networks. LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep. In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers. We show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. …

“Gated linear units are a simplified gating mechanism based on the work of Dauphin & Grangier [Predicting distributions with Linearizing Belief Networks (Nov 2015; updated May 2016)] for non-deterministic gates that reduce the vanishing gradient problem by having linear units coupled to the gates. This retains the non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. … We compare the different gating schemes experimentally in Section 5.2 and we find gated linear units allow for faster convergence to better perplexities.”

• Another explanation is given by Bai et al. (Apr 2018): “The unlimited context offered by recurrent models is not strictly necessary for language modeling.”

In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

• “The ‘infinite memory’ advantage of RNNs is largely absent in practice.”

As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM $\small n$-gram model with $\small n=13$ words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

• “Recurrent models trained in practice are effectively feedforward.”

This could happen either because truncated backpropagation time cannot learn patterns significantly longer than $\small k$ steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

We know very little about how neural language models (LM) use prior linguistic context. A recent paper by Dan Jurafsky and colleagues at Stanford University, Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context (May 2018) investigated the role of context in a LSTM based LM, through ablation studies. On two standard datasets (Penn Treebank and WikiText-2) they found that the model was capable of using about 200 tokens of context on average, but sharply distinguished nearby context (recent 50 tokens) from the distant history. The model was highly sensitive to the order of words within the most recent sentence, but ignored word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. They further found that the neural caching model (Improving Neural Language Models with a Continuous Cache) especially helped the LSTM copy words from within this distant context. Paraphrased from that paper:

• “In this analytic study, we have empirically shown that a standard LSTM language model can effectively use about 200 tokens of context on two benchmark datasets, regardless of hyperparameter settings such as model size. It is sensitive to word order in the nearby context, but less so in the long-range context. In addition, the model is able to regenerate words from nearby context, but heavily relies on caches to copy words from far away.”

• The neural cache model (Improving Neural Language Models with a Continuous Cache (Dec 2016)] augments neural language models with a longer-term memory that dynamically updates the word probabilities based on the long-term context. The neural cache stores the previous hidden states in memory cells for use as keys to retrieve their corresponding (next) word. A neural cache can be added on top of a pretrained language model at negligible cost.

While LSTM has been successfully used to model sequential data of variable length, LSTM can experience difficulty in capturing long-term dependencies. Long Short-Term Memory with Dynamic Skip Connections (Nov 2018) tried to alleviate this problem by introducing a dynamic skip connection, which could learn to directly connect two dependent words. Since there was no dependency information in the training data, they proposed a novel reinforcement learning-based method to model the dependency relationship and connect dependent words. The proposed model computed the recurrent transition functions based on the skip connections, which provided a dynamic skipping advantage over RNNs that always tackle entire sentences sequentially. Experimental results on three NLP tasks demonstrated that the proposed method could achieve better performance than existing methods, and in a number prediction experiment the proposed model outperformed LSTM with respect to accuracy by nearly 20%.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “Across all of the datasets, there exists at least one other dataset that significantly improves performance on a target dataset. These experiments do not support that direct transfer is possible, but that pretraining is at least somewhat effective. QuAC appears to transfer the least to any of other datasets, likely because questioners were not allowed to see underlying context documents while formulating questions. Since transfer is effective between these related tasks, we recommend that future work indicate any pretraining.”

• “Conversational question answering (CQA) is a novel QA task that requires understanding of dialogue context. Different from traditional single-turn machine reading comprehension (MRC) tasks, CQA includes passage comprehension, coreference resolution, and contextual understanding. In this paper, we propose an innovated contextualized attention-based deep neural network, SDNet, to fuse context into traditional MRC models. Our model leverages both inter-attention and self-attention to comprehend conversation context and extract relevant information from passage. Furthermore, we demonstrated a novel method to integrate the latest BERT contextual model. Empirical results show the effectiveness of our model, which sets the new state of the art result in CoQA leaderboard, outperforming the previous best model by 1.6% $\small F_1$. Our ensemble model further improves the result by 2.7% $\small F_1$.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

“To sum up, we proposed a simple yet efficient model based on SAN [Stochastic Answer Network]. It showed that the joint learning algorithm boosted the performance on SQuAD2.0. We also would like to incorporate ELMo into our model in future.”

• AUEB at BioASQ 6: Document and Snippet Retrieval (Sep 2018) [code]

[Image source. Click image to open in new window.]

“We presented the models, experimental set-up, and results of AUEB’s submissions to the document and snippet retrieval tasks of the sixth year of the BioASQ challenge. Our results show that deep learning models are not only competitive in both tasks, but in aggregate were the top scoring systems. This is in contrast to previous years where traditional IR systems tended to dominate. In future years, as deep ranking models improve and training data sets get larger, we expect to see bigger gains from deep learning models.”

• A Knowledge Hunting Framework for Common Sense Reasoning (Oct 2018) [MILA/McGill University; Microsoft Research Montreal] [code]

“We developed a knowledge-hunting framework to tackle the Winograd Schema Challenge (WSC), a task that requires common-sense knowledge and reasoning. Our system involves a semantic representation schema and an antecedent selection process that acts on web-search results. We evaluated the performance of our framework on the original set of WSC instances, achieving F1-performance that significantly exceeded the previous state of the art. A simple port of our approach to COPA [Choice of Plausible Alternatives] suggests that it has the potential to generalize. In the future we will study how this commonsense reasoning technique can contribute to solving ‘edge cases’ and difficult examples in more general coreference tasks.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• A Fully Attention-Based Information Retriever (Oct 2018) [code]

• “Recurrent neural networks are now the state-of-the-art in natural language processing because they can build rich contextual representations and process texts of arbitrary length. However, recent developments on attention mechanisms have equipped feedforward networks with similar capabilities, hence enabling faster computations due to the increase in the number of operations that can be parallelized. We explore this new type of architecture in the domain of question-answering and propose a novel approach that we call Fully Attention Based Information Retriever (FABIR). We show that FABIR achieves competitive results in the Stanford Question Answering Dataset (SQuAD) while having fewer parameters and being faster at both learning and inference than rival methods.”

“The experiments validate that attention mechanisms alone are enough to power an effective question-answering model. Above all, FABIR proved roughly five times faster at both training and inference than BiDAF, a competing RNN-based model with similar performance. … Although FABIR is still far from surpassing the models at the top of the <a href=”https://rajpurkar.github.io/SQuAD-explorer/”green”>SQuAD leaderboard</font></a> (Table III), we believe that its faster and lighter architecture already make it an attractive alternative to RNN-based models, especially for applications with limited processing power or that require low-latency.”

• Critique.

Like FABIR (which is also evaluated with the attention module only, minus convolution – giving satisfactory results), QANet (Apr 2018) is a QA architecture that consists entirely of convolution and self-attention, on the SQuAD dataset is 3x to 13x faster in training and 4x to 9x faster in inference on the state of the art at that time, and places highly on the SQuAD1.1 Leaderboard (2018-10-23). However, the FABIR paper [A Fully Attention-Based Information Retriever (Oct 2018)] fails to cite the earlier, more performant QANet work [*QANet*: Combining Local Convolution with Global Self-Attention for Reading Comprehension].*

Carnegie Mellon University/Google Brain’s *QANet* : Combining Local Convolution with Global Self-Attention for Reading Comprehension” (Apr 2018) begins to address the issue of adversarial challenges to QA. QANet does not require RNN; its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD1.1 dataset [indicated as such on the Leaderboard] their model was 3-13x faster in training and 4-9x faster in inference, while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data. More significantly to this discussion, on the adversarial SQuAD test set QANet achieved significantly improved F1 scores compared to BiDAF and other models (), demonstrating the robustness of QANet to adversarial examples.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Natural Language Inference

Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise $\small p$ (e.g. a piece of text) and a hypothesis $\small h$. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI Corpus), contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of NLI. A newer, Multi-Genre Natural Language Inference (MultiNLI corpus) is also available: a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

NLI was one of the 10 tasks proposed in The Natural Language Decathlon: Multitask Learning as Question Answering, a NLP challenge spanning 10 tasks introduced by Richard Socher and colleagues at Salesforce.

Google’s (A Decomposable Attention Model for Natural Language Inference (Parikh et al., Sep 2016) likewise proposed a simple neural architecture for natural language inference that used attention to decompose the problem into subproblems that could be solved separately, thus making it trivially parallelizable. Their use of attention was purely based on word embeddings, essentially consisting of feedforward networks that operated largely independently of word order. On the Stanford Natural Language Inference (SNLI) dataset, they obtained state of the art results with almost an order of magnitude fewer parameters than previous work, without relying on any word-order information. The approach outperformed considerably more complex neural methods aiming for text understanding, suggesting that – at least for that task – pairwise comparisons are relatively more important than global sentence-level representations.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

However,

• that same model (Parikh et al. 2016; see Table 3 in the image below),
• and also one based on a Bi-LSTM-based single sentence-encoding model without attention (ibid.),
• and a hybrid TreeLSTM-based and Bi-LSTM-based model with an inter-sentence attention mechanism to align words across sentences (ibid.)

… all performed poorly on the newer “Breaking NLI” NLI test set, indicating the difficulty of the task (and reiterating the need for ever more challenging datasets). The new examples were simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set was substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences. That finding recalls my earlier discussion on adversarial challenges to BiDAF/SQuAD-based QA.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. However, efforts to obtain embeddings for larger chunks of text, such as sentences, have not been as successful: several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. For a long time supervised learning of sentence embeddings was thought to give lower-quality embeddings than unsupervised approaches but this assumption has recently been overturned, in part following the publication of the InferSent model by Facebook AI Research (Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (May 2017; updated Jul 2018) [code;  discussion: A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings, and reddit]). The authors showed how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference (SNLI) datasets could consistently outperform unsupervised methods, like SkipThought vectors, on a wide range of transfer tasks – indicating the suitability of NLI for transfer learning to other NLP tasks. Much like how computer vision used ImageNet to obtain features, which could then be transferred to other tasks, their work indicated the suitability of natural language inference for transfer learning to other NLP tasks.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• InferSent was an interesting approach by the simplicity of its architecture, a bi-directional LSTM complete with a max-pooling operator as sentence encoder. InferSent used the SNLI dataset (a set of of 570k pairs of sentences labeled with 3 categories: neutral, contradiction and entailment) to train the classifier on top of the sentence encoder. Both sentences were encoded using the same encoder, while the classifier was trained on a pair representation constructed from the two sentence embeddings.

• Investigating the Effects of Word Substitution Errors on Sentence Embeddings (Nov 2018) investigated the effects of word substitution errors, such as those coming from automatic speech recognition errors (ASR), on several state of the art sentence embedding methods. Their results showed that pre-trained encoders such as InferSent were both robust to ASR errors and performed well on textual similarity tasks after errors were introduced.

In a very similar architecture to InferSent (compare the images above/below), Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture (Aug 2018) [code] from the University of Helsinki yielded state of the art results for SNLI sentence encoding-based models and the SciTail dataset, and provided strong results for the MultiNLI dataset. [The SciTail dataset is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs. Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis.] The sentence embeddings could be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7/10 and SkipThought on 8/9 SentEval sentence embedding evaluation tasks. Furthermore, their model beat the InferSent in 8/10 recently published SentEval probing tasks designed to evaluate the ability of sentence embeddings to capture some of the important linguistic properties of sentences.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

“The success of the proposed hierarchical architecture raises a number of additional interesting questions. First, it would be important to understand what kind of semantic information the different layers are able to capture. Second, a detailed and systematic comparison of different hierarchical architecture configurations, combining Bi-LSTM and max pooling in different ways, could lead to even stronger results, as indicated by the results we obtained on the SciTail dataset with the modified 4-layered model. Also, as the sentence embedding approaches for NLI focus mostly on the sentence encoder, we think that more should be done to study the classifier part of the overall NLI architecture. There is not enough research on classifiers for NLI and we hypothesize that further improvements can be achieved by a systematic study of different classifier architectures, starting from the way the two sentence embeddings are combined before passing on to the classifier.”

Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. Bridging Knowledge Gaps in Neural Entailment via Symbolic Models (Sep 2018) focused on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Their architecture (NSnet) combined standard neural entailment models with a knowledge lookup module. To facilitate this lookup, they proposed a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. NSnet learned to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSnet outperformed a simpler combination of the two predictions by 3% and the base entailment model by 5%.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Natural Language Inference:]