Technical Review

Natural Language Understanding

Last modified: 2018-12-11


Copyright notice, citation: Copyright
Source
© 2018-present, Victoria A. Stuart


These Contents


[Table of Contents]

NATURAL LANGUAGE UNDERSTANDING

Machine learning is particularly well suited to assisting and even supplanting many standard NLP approaches (for a good review see Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities (Jun 2018)). Language models, for example, provide improved understanding of the semantic content and latent (hidden) relationships in documents. Machine based natural language understanding (NLU) is a fundamental requirement for robust, human level performance in tasks such as information retrieval, text summarization, question answering, textual entailment, sentiment analysis, reading comprehension, commonsense reasoning, recommendation, etc.

arxiv-1807.00123a.png

[Image source. Click image to open in new window.]


arxiv-1807.00123b.png

[Image source. Click image to open in new window.]


arxiv-1807.00123c.png

[Image source. Click image to open in new window.]


Advances in NLU offer tremendous promise for the analysis of biomedical and clinical text, which due to the use of technical, domain-specific jargon is particularly challenging for traditional NLP approaches. Some of these challenges and difficulties are described in the August 2018 post NLP’s Generalization Problem, and How Researchers are Tackling It  [discussion].

Recent developments in NLP and ML that I believe are particularly important to advancing NLU include:

  • understanding the susceptibility of QA systems to adversarial challenge;

  • the development of deeply-trained/pretrained language models;

  • transfer learning and multitask learning;

  • reasoning over graphs;

  • the development of more advanced memory and attention-based architectures; and,

  • incorporating external memory mechanisms; e.g., a differentiable neural computer, which is essentially an updated version of a neural Turing machine (What Is the Difference between Differentiable Neural Computers and Neural Turing Machines?). Relational database management systems (RDBMS), textual knowledge stores (TKS) and knowledge graphs (KG) also represent external knowledge stores, that may possibly be leveraged as potential external memory resources of external memory architectures suitable for NLP and ML.

DeepMind’s recent paper Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies (Aug 2018) addressed preserving and reusing past knowledge (memory) via unsupervised representation learning using a variational autoencoder: VASE (Variational Autoencoder with Shared Embeddings). VASE automatically detected shifts in data distributions and allocated spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting:

    "... thanks to learning a generative model of the observed environments, we can prevent **catastrophic forgetting** by periodically "hallucinating" (i.e. generating samples) from past environments using a snapshot of VASE, and making sure that the current version of VASE is still able to model these samples. A similar "dreaming" feedback loop was used in Lifelong Generative Modeling, ..."

arxiv-1808.06508.png

[Image source. Click image to open in new window.]


  • For similar, prior work by other authors (cited) that also used a variational autoencoder, see Lifelong Generative Modeling, below.

  • As noted, each of the papers cited above addressed the issue of catastrophic forgetting. Interestingly, the Multitask Question Answering Network (MQAN), described in Richard Socher’s “decaNLP/MQAN” paper, attained robust multitask learning, performing nearly as well or better in the multitask setting as in the single task setting for each task despite being capped at the same number of trainable parameters in both. … This suggested that MQAN successfully used trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limited catastrophic forgetting.

Lifelong learning is the problem of learning multiple consecutive tasks in a sequential manner where knowledge gained from previous tasks is retained and used for future learning. It is essential towards the development of intelligent machines that can adapt to their surroundings. Lifelong Generative Modeling (Sep 2018), by authors at the University of Geneva and the Geneva School of Business Administration, focused on a lifelong learning approach to generative modeling where we continuously incorporate newly observed distributions into their learnt model. We did so through a student-teacher variational autoencoder architecture which allowed them to learn and preserve all the distributions seen to that point without the need to retain the past data nor the past models. Through the introduction of a novel cross-model regularizer, inspired by a Bayesian update rule, the student model leveraged the information learnt by the teacher, which acted as a summary of everything seen to that point. The regularizer had the additional benefit of reducing the effect of catastrophic interference that appears when sequences of distributions are learned. They demonstrated its efficacy in learning sequentially observed distributions as well as its ability to learn a common latent representation across a complex transfer learning scenario.

arxiv1705.09847-f1.png

[Image source. Click image we learn over to open in new window.]


arxiv1705.09847-f2.png

[Image source. Click image we learn over to open in new window.]


Continual learning is the ability to sequentially learn over time by accommodating knowledge while retaining previously learned experiences. Neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on previously learned tasks when tasks are presented one at a time. This problem is called catastrophic forgetting. Continual Classification Learning Using Generative Models (Oct 2018) by authors at the University of Geneva and the Geneva School of Business Administration propose a classification model that learns continuously from sequentially observed tasks, while preventing catastrophic forgetting. We build on [our previous work on] the lifelong generative capabilities of Lifelong Generative Modeling and extend it to the classification setting by deriving a new variational bound on the joint log likelihood, $\small log p(x,y)$.

arxiv1810.10612-f1.png

[Image source. Click image to open in new window.]


arxiv1810.10612-f2.png

[Image source. Click image to open in new window.]


Google Brain’s A Simple Method for Commonsense Reasoning (Jun 2018) [codeslidesdiscussiondiscussion] presented a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to the method was the use of an array of large RNN language models that operated at word or character level, trained on a massive amount of unlabeled data, to score multiple choice questions posed by commonsense reasoning tests.

arxiv-1806.02847.png

[Image source. Click image to open in new window.]


  • This paper was subsequently savaged in an October, 2018 commentary, A Simple Machine Learning Method for Commonsense Reasoning? A Short Commentary on Trinh & Le (2018):

    A Concluding Remark. The data-driven approach in AI has without a doubt gained considerable notoriety in recent years, and there are a multitude of reasons that led to this fact. While the data-driven approach can provide some useful techniques for practical problems that require some level of natural language processing (text classification and filtering, search, etc.), extrapolating the relative success of this approach into problems related to commonsense reasoning, the kind that is needed in true language understanding, is not only misguided, but may also be harmful, as this might seriously hinder the field, scientifically and technologically.”

A Simple Neural Network Module for Relational Reasoning (Jun 2017) [DeepMind blog;  non-author code here and here;  discussion herehere and here] by DeepMind described Relation Networks, a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning including visual question answering, text-based question answering using the bAbI suite of tasks, and complex reasoning about dynamic physical systems. They showed that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with relational networks, to implicitly discover and learn to reason about entities and their relations.

arxiv-1706.01427.png

[Image source. Click image to open in new window.]


While Relation Networks – introduced by Santoro et al. (2017) [DeepMind’s “A Simple Neural Network Module for Relational Reasoning,” above – demonstrated strong relational reasoning capabilities, its rather shallow architecture (a single-layer design) only considered pairs of information objects making it unsuitable for problems requiring reasoning across a higher number of facts. To overcome this limitation, authors at the University of Lübeck presented proposed Multi-layer Relation Networks (Nov 2018) [code], a multi-layer relation network architecture which enabled successive refinements of relational information through multiple layers. They showed that the increased depth allowed for more complex relational reasoning, by applying it to the bAbI 20 QA dataset, solving all 20 tasks with joint training and surpassing the state of the art results.

arxiv1811.01838-f1+f2.png

[Image source. Click image to open in new window.]


arxiv1811.01838-t1.png

[Image source. Click image to open in new window.]


arxiv1811.01838-tA1.png

[Image source. Click image to open in new window.]


Natural Language Understanding:

Additional Reading

  • On the Evaluation of Common-Sense Reasoning in Natural Language Understanding (Nov 2018) [datasets]

    “The NLP and ML communities have long been interested in developing models capable of common-sense reasoning, and recent works have significantly improved the state of the art on benchmarks like the Winograd Schema Challenge (WSC). Despite these advances, the complexity of tasks designed to test common-sense reasoning remains under-analyzed. In this paper, we make a case study of the Winograd Schema Challenge and, based on two new measures of instance-level complexity, design a protocol that both clarifies and qualifies the results of previous work. Our protocol accounts for the WSC’s limited size and variable instance difficulty, properties common to other common-sense benchmarks. Accounting for these properties when assessing model results may prevent unjustified conclusions.”


[Table of Contents]

Word Embeddings

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers (Sebastian Ruder provides a good overview; see also this excellent post, Introduction to Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov et al. (Sep 2013) Efficient Estimation of Word Representations in Vector Space – the “word2vec” paper].

arxiv-1301.3781.png

[Image source. Click image to open in new window.]


cbo_vs_skipgram.png

[Image source. Click image to open in new window.]


Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g.,

    vBerlin - vGermany + vFrance ~ vParis
    vking - vman + vwoman ~ vqueen.

The original work
Source
for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. ”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes. ” The full reference is: Y. Bengio et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – viz-a-viz my previous comment (I’m being rather critical here) – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.

For a theoretical discussion of word vectors, see Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline  [codediscussion], which is a critique/extension of A Latent Variable Model Approach to PMI-based Word Embeddings. In addition to proposing a new generative model – a dynamic version of the log-linear topic model of Mnih and Hinton (2007) [Three New Graphical Models for Statistical Language Modelling] – the paper provided a theoretical justification for nonlinear models like PMI, word2vec, and GloVe. It also helped explain why low dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013)  [see the algebraic examples, above]. Experimental support was provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.

Related, Sebastion Ruder recently provided a summary of ACL 2018 highlights, including a subsection entitled Understanding Representations: “It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture.”

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New Perspectives (Apri 2014); see also the excellent blog posts Deep Learning, NLP, and Representations by Chris Olah, and An introduction to representation learning by Michael Alcorn). Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning).

In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

  • the meanings of words;
  • how words are combined to form concepts (i.e., syntax); and,
  • how concepts relate to the task at hand.

Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference (Oct 2018) [discussion] by the Paul G. Allen School of Computer and Science Engineering, and Facebook AI Research, proposed new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. Their pairwise embeddings were computed as a compositional function of each word’s representation, which was learned by maximizing the pointwise mutual information (PMI) with the contexts in which the two words co-occurred. They added these representations to the cross-sentence attention layer of existing inference models (e.g. BiDAF for QA, ESIM for NLI), instead of extending or replacing existing word embeddings. Experiments showed a gain of 2.72% on the recently released SQuAD2.0 and 1.3% on MultiNLI. Our representations also aid in better generalization with gains of around 6-7% on adversarial SQuAD datasets, and 8.8% on the adversarial entailment test set by Glockner et al.

arxiv1810.08854-t1.png

[Image source. Click image to open in new window.]


arxiv1810.08854-f1.png

[Image source. Click image to open in new window.]


arxiv1810.08854-t2+t3+t4.png

[Image source. Click image to open in new window.]


arxiv1810.08854-t1.png

[Image source. Click image to open in new window.]


Word Embeddings:

Additional Reading

  • Towards Understanding Linear Word Analogies (Oct 2018)

    • “A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as Skip-Gram with Negative Sampling (SGNS). We provide a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. Our theory has several implications. Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.”

    • [Section 5] Even though vector algebra is surprisingly effective at solving word analogies, the csPMI Theorem reveals two reasons for why an analogy may be unsolvable in a given embedding space: polysemy and corpus bias. …

  • Dynamic Meta-Embeddings for Improved Sentence Representations (Kyunghyun Cho and colleagues at Facebook AI Research; Sep 2018) [projectcodediscussion]

    • “While one of the first steps in many NLP systems is selecting what pre-trained word embeddings to use, we argue that such a step is better left for neural networks to figure out by themselves. To that end, we introduce dynamic meta-embeddings, a simple yet effective method for the supervised learning of embedding ensembles, which leads to state-of-the-art performance within the same model class on a variety of tasks. We subsequently show how the technique can be used to shed new light on the usage of word embeddings in NLP systems.”

      “We argue that the decision of which word embeddings to use in what setting should be left to the neural network. While people usually pick one type of word embeddings for their NLP systems and then stick with it, we find that dynamically learned meta-embeddings lead to improved results. In addition, we showed that the proposed mechanism leads to better interpretability and insightful linguistic analysis. We showed that the network learns to select different embeddings for different data, different domains and different tasks. We also investigated embedding specialization and examined more closely whether contextualization helps. To our knowledge, this work constitutes the first effort to incorporate multi-modal information on the language side of image-caption retrieval models; and the first attempt at incorporating meta-embeddings into large-scale sentence-level NLP tasks.”

    arxiv1804.07983-t1.png

    [Image source. Click image to open in new window.]
  • End-to-End Retrieval in Continuous Space (Google AI: Nov 2018)

    “Most text-based information retrieval (IR) systems index objects by words or phrases. These discrete systems have been augmented by models that use embeddings to measure similarity in continuous space. But continuous-space models are typically used just to re-rank the top candidates. We consider the problem of end-to-end continuous retrieval, where standard approximate nearest neighbor (ANN) search replaces the usual discrete inverted index, and rely entirely on distances between learned embeddings. By training simple models specifically for retrieval, with an appropriate model architecture, we improve on a discrete baseline by 8% and 26% (MAP) on two similar-question retrieval tasks. We also discuss the problem of evaluation for retrieval systems, and show how to modify existing pairwise similarity datasets for this purpose.”

    arxiv1811.08008-t1+t2.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Addressing Hypernymy and Polysemy with Word Embeddings

Word embeddings have many uses in NLP. For example, polysemy – words or phrases with different, but related, meanings [e.g. “Washington” may refer to “Washington, DC” (location) or “George Washington” (person)] – pose one of many challenges to NLP. Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal.

In Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation (Jun 2016) [code] the authors offered a solution to the polysemy problem. They proposed a novel embedding method specifically designed for named entity disambiguation that jointly mapped words and entities into the same continuous vector space. Since similar words and entities were placed close to one another in vector space in this model, the similarity between any pair of items (e.g. words, entities, and a word and an entity) could be measured by simply computing their cosine similarity.

Though not cited in that paper, the code for that work (by coauthor and Studio Ousia employee Ikuya Yamada) was made available on GitHub in the Wikipedia2Vec repository; the Wikipedia2Vec project page contains the pretrained embeddings (models).

A probabilistic extension of fastTextProbabilistic FastText for Multi-Sense Word Embeddings – can produce accurate representations of rare, misspelt, and unseen words. Probabilistic FastText achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words:

  • “Our multimodal word representation can also disentangle meanings, and is able to separate different senses in foreign polysemies. In particular, our models attain state-of-the-art performance on SCWS, a benchmark to measure the ability to separate different word meanings, achieving 1.0% improvement over a recent density embedding model W2GM (Athiwaratkun and Wilson, 2017). To the best of our knowledge, we are the first to develop multi-sense embeddings with high semantic quality for rare words.”

  • “… we show that our probabilistic representation with subword mean vectors with the simplified energy function outperforms many word similarity baselines and provides disentangled meanings for polysemies.”

  • “We show that our embeddings learn the word semantics well by demonstrating meaningful nearest neighbors
    Source
    meaningful nearest neighbors. Table 1 shows examples of polysemous words such as ‘rock’, ‘star’, and ‘cell’. Table 1
    Source
    shows the nearest neighbors of polysemous words. We note that subword embeddings prefer words with overlapping characters as nearest neighbors. For instance, ‘rock-y’, ‘rockn’, and ‘rock’ are both close to the word ‘rock’. For the purpose of demonstration, we only show words with meaningful variations and omit words with small character-based variations previously mentioned. However, all words shown are in the top-100 nearest words. We observe the separation in meanings for the multi-component case; for instance, one component of the word ‘bank’ corresponds to a financial bank whereas the other component corresponds to a river bank. The single-component case also has interesting behavior. We observe that the subword embeddings of polysemous words can represent both meanings. For instance, both ‘lava-rock’ and ‘rock-pop’ are among the closest words to ‘rock’.”

Wasserstein is All you Need (Aug 2018) [discussion] proposed a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding). Their method gives a novel perspective for building rich and powerful feature representations that simultaneously capture uncertainty (via a distributional estimate) and interpretability (with the optimal transport map). Among their various applications (e.g. entailment detection; semantic similarity), they proposed to represent sentences as probability distributions to better capture the inherent uncertainty and polysemy, arguing that histograms (or probability distributions) over embeddings allows the capture of more of this information than point-wise embeddings, alone. They discuss hypernymy detection in Section 7; for this purpose, they relied on a recently proposed model that which explicitly modeled what information is known about a word by interpreting each entry of the embedding as the degree to which a certain feature is present.

arxiv-1808.09663a.png

[Image source. Click image to open in new window.]


This image is particularly illustrative (click, and click again, to enlarge):

arxiv-1808.09663b.png

[Image source. Click image to open in new window.]


  • “While existing methods represent each entity of interest (e.g., a word) as a single point in space (e.g., its embedding vector), we here propose a fundamentally different approach. We represent each entity based on the histogram of contexts (co-occurring with it), with the contexts themselves being points in a suitable metric space. This allows us to cast the distance between histograms associated with the entities as an instance of the optimal transport problem [see Section 3 for a background on optimal transport]. For example, in the case of words as entities, the resulting framework then intuitively seeks to minimize the cost of moving the set of contexts of a given word to the contexts of another [note their Fig, 1]. Note that the contexts here can be words, phrases, sentences, or general entities co-occurring with our objects to be represented, and these objects further could be any type of events extracted from sequence data …”

  • Regarding semantic embedding, or word sense disambiguation (not explicitly discussed in the paper), their Fig.2 [Illustration of three words, each with their distributional estimates (left), as well as the point estimates of the relevant contexts (middle), as well as joint representation (right)] is very interesting: words in vector space, along with a histogram of their probability distributions over those embedded spaces.

  • “Software Release. We plan to make all our code (for all these parts) and our pre-computed histograms (for the mentioned datasets) publicly available on GitHub soon.”  [Not available: 2018-10-07]

Early in 2018 pretrained language models such as ELMo offered another approach to solve the polysemy problem.

[Table of Contents]

Word Sense Disambiguation

Related to polysemy and named entity disambiguation is word sense disambiguation (WSD). Learning Graph Embeddings from WordNet-based Similarity Measures described a new approach, path2vec, for learning graph embeddings that relied on structural measures of node similarities for generation of training data. Evaluations of the proposed model on semantic similarity and WSD tasks showed that path2vec yielded state of the art results.

In January 2018 Ruslan Salakhutdinov and colleagues proposed a probabilistic graphical model that leveraged a topic model to design a WSD system (WSD-TM ) that scaled linearly with the number of words in the context. Their logistic normal topic model – a variant of latent Dirichlet allocation in which the topic proportions for a document were replaced by WordNet  synsets
Source
(sets of synonyms) – incorporated semantic information about synsets as its priors. WSD-TM outperformed state of the art knowledge-based WSD systems.

[Table of Contents]

Probing the Role of Attention in Word Sense Disambiguation

Recent work has shown that the encoder-decoder attention mechanisms in neural machine translation (NMT) are different from the word alignment in statistical machine translation. An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation (Oct 2018) focused on analyzing encoder-decoder attention mechanisms, in the case of word sense disambiguation (WSD) in NMT models. They hypothesized that attention mechanisms pay more attention to context tokens when translating ambiguous words, and explored the attention distribution patterns when translating ambiguous nouns. Counterintuitively, they found that attention mechanisms were likely to distribute more attention to the ambiguous noun itself rather than context tokens, in comparison to other nouns. They concluded that the attention mechanism was not the main mechanism used by NMT models to incorporate contextual information for WSD. The experimental results suggested that NMT models learned to encode the contextual information necessary for WSD in the encoder hidden states. For the attention mechanism in Transformer models, they revealed that the first few layers gradually learn to “align” source and target tokens, and the last few layers learn to extract features from the related but unaligned context tokens.

arxiv1810.07595-f1.png

[Image source. Click image to open in new window.]


arxiv1810.07595-f2.png

[Image source. Click image to open in new window.]


arxiv1810.07595-t3.png

[Image source. Click image to open in new window.]


[Table of Contents]

Applications of Embeddings in the Biological Sciences

While predicting protein 3D structure from primary amino acid sequences has been a long-standing objective in bioinformatics, definite solutions remain to be found  [discussion]. The most reliable approaches currently available involve homology modeling, which allows assigning a known protein structure to an unknown protein, provided that there is detectable sequence similarity between the two. When homology modeling is not viable de novo techniques, based on physical-based potentials or knowledge-based potentials, are needed. Unfortunately proteins are very large molecules, and the huge amount of available conformations, even for relatively small proteins, makes it prohibitive to fold them even on customized computer hardware.

To address this challenge, knowledge based potentials can be learned from statistics or machine learning methods to infer useful information from known examples of protein structures. This information can be used to constrain the problem, greatly reducing the amount of samples that need to be evaluated when dealing exclusively with physics-based potentials. Multiple sequence alignments (MSA) consists of aligned sequences homologous to the target protein, compressed into position-specific scoring matrices (PSSM, also called sequence profiles) using the fraction of occurrences of different amino acids in the alignment for each position in the sequence. More recently, contact map prediction methods have been at the center of renewed interest; however, their impressive performance is correlated with the amount of sequences in the MSA, and is not as reliable when few sequences are related to the target.

rawMSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete introduced a new approach, called rawMSA, for the de novo prediction of structural properties of proteins. The core idea behind rawMSA was to borrow from the word2vec word embedding from Mikolov et al. (Efficient Estimation of Word Representations in Vector Space), which they used to convert each character (amino acid residue) in the MSA into a floating point vector of variable size, thus representing the residues by the structural property they were trying to predict. Test results from deep neural networks based on this concept showed that rawMSA matched or outperformed the state of the art on three tasks: predicting secondary structure, relative solvent accessibility, and residue-residue contact maps.

[Table of Contents]

Probing the Effectiveness of Word Embeddings

A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as Skip-Gram with Negative Sampling (SGNS). Towards Understanding Linear Word Analogies (Oct 2018) provided a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. “Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.”

  • “In this paper, we provided a rigorous explanation of why - and when - word analogies can be solved using vector algebra. More specifically, we proved that an analogy holds over a set of word pairs in an SGNS or GloVe embedding space with no reconstruction error i.f.f.  the co-occurrence shifted PMI is the same for every word pair. Our theory had three implications. … [See comments below.] … Most importantly, our theory did not make the unrealistic assumptions that past theories have made about the word distribution and vector space, making it much more tenable than previous explanations.”

  • [discussion: Hacker News]: “Hi, first author here! Feel free to ask any questions. TL;DR: We prove that linear word analogies hold over a set of ordered pairs (e.g., $\small \text{{(Paris, France), (Ottawa, Canada), …})}$ in an SGNS or GloVe embedding space with no reconstruction error when $\small \text{PMI(x,y) + log p(x,y)}$ is the same for every word pair $\small \text{(x,y)}$. We call this term the csPMI (co-occurrence shifted PMI). This has a number of interesting implications:

    1. It implies that Pennington et al. [Socher; Manning] (authors of GloVe) had the right intuition about why these analogies hold.
    2. Adding two word vectors together to compose them makes sense, because you’re implicitly downweighting the more frequent word – like TF-IDF or SIF would do explicitly.
    3. Using Euclidean distance to measure word dissimilarity make sense because the Euclidean distance is a linear function of the negative csPMI.”

[Table of Contents]

Memory Based Architectures

For a more detailed description of how neural networks “learn” see my blog post How do Neural Networks "Remember"?  In essence, the answer is that memory forms during the training of the parameters (i.e., the trained weights); the matrix of trained weights are the memory.



Memory (the ability to recall previous facts and knowledge) is a crucial requirement for natural language understanding, reasoning (the process of forming an answer a new question by manipulating previously acquired knowledge), and the guidance of decision making. Without memory, agents must act reflexively according only to their immediate percepts and cannot execute plans that occur over an extended time intervals (Neural Map: Structured Memory for Deep Reinforcement Learning).

Broadly speaking, computational approaches to memory include:

  • internal, volatile “short-term” memories algorithmically generated within RNN, LSTM, and self-attention modules;

  • external, volatile memories algorithmically generated by neural Turing machines, memory networks, and differential neural computers; and

  • external, permanent long-term “memories” embedded within knowledge bases and knowledge graphs (for relevant discussion, see my Text Grounding: Mapping Text to Knowledge Graphs; External Knowledge Lookup subsection).

Neural Architectures with Memory  [local copy] provides an excellent overview of neural memory architectures.

Short term memory architectures are commonly employed in the various models discussed in this REVIEW. For example, RNNLSTM, dynamic memory networks (DMN), etc. serve as “working memory” in summarization, question answering and other tasks. Long short-term memory (LSTM) networks are a specialized type of recurrent neural network (RNN) that are capable of learning long term dependencies as well as short term memories of recent transactions.

However, most machine learning models lack an easy way to read and write to part of a (potentially very large) long-term memory component, and to combine this seamlessly with inference. While RNN can be trained to predict the next word to output after reading a stream of words, their memory (encoded by hidden states and weights) is typically too small and is not compartmentalized enough to accurately remember facts from the past (as the knowledge is compressed into dense vectors, from which those memories are not easily accessed). RNNs are also known to have difficulty in performing memorization, for example the simple copying task of outputting the same input sequence they have just read.

Neural networks that utilize external memories can be classified into two main categories: memories with write operators, and those without (Neural Map: Structured Memory for Deep Reinforcement Learning). Regarding the latter type, memory networks (MemNN, introduced by Jason Weston et al. at Facebook AI Research) are a class of deep networks that jointly learn how to reason with inference components combined with a long-term memory component that can be written to and read from, with the goal of using it for prediction. Instead of using a recurrent matrix to retain information through time, memory networks learn how to operate effectively with the memory component.

Memory networks employ explicit addressable memory, that fixes which memories are stored. For example, at each time step, the memory network would store the past $\small M$ states that have been seen in an environment. Therefore, what is learned by the network is how to access or read from this fixed memory pool, rather than what contents to store within it. In sidestepping the difficulty of learning what information to store in memory, memory networks introduce two main disadvantages: storing a potentially significant amount of redundant information; and, relying on domain experts to choose what to store in the memory (Neural Map: Structured Memory for Deep Reinforcement Learning). The memory network approach has been successful in language modeling and question answering, and was shown to be a successful memory for deep reinforcement learning agents in complex 3D environments (Neural Map: Structured Memory for Deep Reinforcement Learning and references therein).

  • Tracking the World State with Recurrent Entity Networks (May 2017) [OpenReview; non-author code here and here], by Jason Weston and Yann LeCun, introduced the Recurrent Entity Network (EntNet). EntNet was equipped with a dynamic long-term memory, which allowed it to maintain and update a representation of the state of the world as it received new data. For language understanding tasks, it could reason on the fly as it read text, not just when it was required to answer a question or respond, as was the case for Jason Weston’s MemN2N memory network (“End-To-End Memory Networks ”). Like a neural Turing machine or differentiable neural computer, EntNet maintained a fixed size memory and could learn to perform location and content-based read and write operations. However, unlike those models, it had a simple parallel architecture in which several memory locations could be updated simultaneously. EntNet set a new state of the art on the bAbI tasks, and was the first method to solve all the tasks in the 10k training examples setting. Weston and LeCun also demonstrated that EntNet could solve a reasoning task which required a large number of supporting facts, which other methods were not able to solve, and could generalize past its training horizon.

In contrast to memory networks, external neural memories having write operations are potentially far more efficient, since they can learn to store salient information for unbounded time steps and ignore any other useless information, without explicitly needing any knowledge a priori on what to store. A prominent research direction on write-based architectures has been recurrent architectures that mimic computer memory systems that explicitly separate memory from computation, analogous to how a CPU (processor/controller) interacts with an external memory (tapes; RAM; GPU) in digital computers. One such model, the Differentiable Neural Computer (DNC) – and its predecessor the Neural Turing Machine (NTM) – structure the architecture to explicitly separate memory from computation. The DNC has a recurrent neural controller that can access an external memory resource by executing differentiable read and write operations. This allows the DNC to act and memorize in a structured manner resembling a computer processor, where read and write operations are sequential and data is store distinctly from computation. The DNC has been used successfully to solve complicated algorithmic tasks, such as finding shortest paths in a graph or querying a database for entity relations.

NTM-DNC.png

There has been extensive work in the NLP domain regarding the use of neural Turing machines  (NTM), and to a lesser extent, differentiable neural computers  (DNC). For a slightly dated (current to ~2017) summary listing of NTM and DNC, see my web page (this is a huge file: on slow connections, wait for the page to fully load). Notable, among those papers are the following items.

  • Survey of Reasoning using Neural Networks (Mar 2017) provided an excellent summary – including relevant background – of neural network approaches to reasoning and inference with a focus on the need for memory networks (e.g. the MemN2N end-to-end memory network, and large external memories). Among the algorithms surveyed and compared were a LSTM, a NTM with a LSTM controller, and a NTM with a feedforward controller (demonstrating the superior performance of the NTM over the LSTM).

  • The model described in Robust and Scalable Differentiable Neural Computer for Question Answering (Jul 2018) was designed as a general problem solver which could be used in a wide range of tasks. Their GitHub repository contains an implementation of a Advanced Differentiable Neural Computer (ADNC), providing more robust and scalable use in Question Answering.

    arxiv-1807.02658c.png

    [Image source. Click image to open in new window.]


LSTMs were used in Augmenting End-to-End Dialog Systems with Commonsense Knowledge (Feb 2018), which investigated the impact of providing commonsense knowledge about concepts (integrated as external memory) on human-computer conversation. Their method was based on a NIPS 2015 workshop paper, Incorporating Unstructured Textual Knowledge Sources into Neural Dialogue Systems, which described a method to leverage additional information about a topic using a simple combination of hashing and TF-IDF to quickly identify the most relevant portions of text from the external knowledge source, based on the current context of the dialogue. In that work, three recurrent neural networks (RNNs) were trained: one to encode the selected external information, one to encode the context of the conversation, and one to encode a response to the context. Outputs of these modules were combined to produce the probability that the response was the actual next utterance given the context.

[Table of Contents]

Attention and Memory

Jason Weston et al. (Facebook AI Research) introduced Memory Networks (MemNN) in Oct 2014 (updated Nov 2015).

arxiv-1410.3916.png

[Image source. Click image to open in new window.]


Although that paper lacked a schematic, the memory network architecture is well described in the paper, and this image:

memory_network

["memory network (MemNN)" (image source; click image to open in new window)]


  • A memory network consists of a memory $\small \mathbf{m}$ (an array of objects (for example an array of vectors or an array of strings) indexed by $\small \mathbf{m}_i$) and four (potentially learned) components $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$ and $\small \mathbf{R}$ as follows:

    • $\small \mathbf{I}$ (input feature map): converts the incoming input to the in ternal feature representation.
    • $\small \mathbf{G}$ (generalization): updates old memories given the new input. We call this generalization as there is an opportunity for the network to compress and generalize its memories at this stage for some intended future use.
    • $\small \mathbf{O}$ (output feature map): produces a new output (in the feature representation space), given the new input and the current memory state.
    • $\small \mathbf{R}$ (response): converts the output into the response format desired. For example, a textual response or an action.
  • $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$ and $\small \mathbf{R}$ can all potentially be learned components and make use of any ideas from the existing machine learning literature. In question answering systems, for example, the components may be instantiated as follows:

    • $\small \mathbf{I}$ can make use of standard pre-processing such as parsing, coreference, and entity resolution. It could also encode the input into an internal feature representation by converting from text to a sparse or dense feature vector.
    • The simplest form of $\small \mathbf{G}$ is to store $\small \mathbf{I}(\mathbf{x})$ in a “slot” in the memory: $\small \mathbf{m}_{\mathbf{H}(\mathbf{x})} = \mathbf{I}(\mathbf{x})$, where $\small \mathbf{H}(\cdot)$ is a function selecting the slot. That is, $\small \mathbf{G}$ updates the index $\small \mathbf{H}(\mathbf{x})$ of $\small \mathbf{m}$, but all other parts of the memory remain untouched.
      Restated yet again: the simplest form of $\small \mathbf{G}$ is to introduce a function $\small \mathbf{H}$ which maps the internal feature representation produced by $\small \mathbf{I}$ to an individual memory slot, and just updates the memory at $\small \mathbf{H(I(x))}$. question* $\small \mathbf{O}$ reads from memory and performs inference to deduce the set of relevant memories needed to perform a good response.
    • $\small \mathbf{R}$ would produce the actual wording of the question-answer based on the memories found by $\small \mathbf{O}$. For example, $\small \mathbf{R}$ could be an RNN conditioned on the output of $\small \mathbf{O}$
  • Note that the original memory network (MemNN, above) lacked an attention mechanism.

  • When the components $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$, & $\small \mathbf{R}$ (above) were neural networks, the authors (Weston et al.) described the resulting system as a memory neural network (MemN2N – which they built for QA (question answering) problems.

The highly cited MemN2N architecture (End-To-End Memory Networks (Nov 2015) [code;  non-author code herehere and here;  discussion here and here]), introduced by Jason Weston and colleagues at Facebook AI Research, is a recurrent attention model over an external memory. The model involved multiple computational steps (termed “hops”) per output symbol. In this RNN architecture, the recurrence read from a possibly large external memory multiple times before outputting a symbol. The architecture was trained end-to-end and hence required significantly less supervision during training; the flexibility of the model allowed them to apply it to tasks as diverse as synthetic question answering and language modeling.

arxiv-1503.08895d.png

[Image source. Click image to open in new window.]


For question answering MemN2N was competitive with memory networks but with less supervision; for language modeling, MemN2N demonstrated performance comparable to RNN and LSTM on the Penn Treebank and Text8 datasets. In both cases they showed that the key concept of multiple computational hops yielded improved results. Unlike a traditional RNN, the average activation weight of memory positions during the memory hops did not decay exponentially: it had roughly the same average activation across the entire memory (Fig. 3 in the image, above), which may have been the source of the observed improvement in language modeling.

“We also vary the number of hops and memory size of our MemN2N, showing the contribution of both to performance; note in particular that increasing the number of hops helps. In Fig. 3, we show how MemN2N operates on memory with multiple hops. It shows the average weight of the activation of each memory position over the test set. We can see that some hops concentrate only on recent words, while other hops have more broad attention over all memory locations, which is consistent with the idea that successful language models consist of a smoothed n-gram model and a cache. Interestingly, it seems that those two types of hops tend to alternate. Also note that unlike a traditional RNN, the cache does not decay exponentially: it has roughly the same average activation across the entire memory. This may be the source of the observed improvement in language modeling.

MemN2N.png

[Image source. Click image to open in new window)]


Here is the MemN2N architecture, from the paper (End-To-End Memory Networks:

MemN2N-arxiv-1503.08895.png

[Image source. Click image to open in new window]


  • “Our model takes a discrete set of inputs x1, …, xn that are to be stored in the memory, a query q, and outputs an answer a. Each of the xi, q, and a contains symbols coming from a dictionary with V words. The model writes all x to the memory up to a fixed buffer size, and then finds a continuous representation for the x and q. The continuous representation is then processed via multiple hops to output a. This allows backpropagation of the error signal through multiple memory accesses back to the input during training.”

  • “… The entire set of {xi} are converted into memory vectors {mi} of dimension d computed by embedding each xi in a continuous space, in the simplest case, using an embedding matrix A (of size d × V). …”  ←  i.e., the vectorized input is stored as external memory

A recent paper from DeepMind (Relational Recurrent Neural Networks (Jun 2018) [code; discussion here and here] is also of interest with regard to language modeling and reasoning over natural language text. While memory based neural networks model temporal data by leveraging an ability to remember information for long periods, it is unclear whether they also have an ability to perform complex relational reasoning with the information they remember. In this paper the authors first confirmed their intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected (i.e., tasks involving relational reasoning). They then improved upon these deficits by using a new memory module, a Relational Memory Core (RMC; Fig. 1 in that paper), which showed large gains in reinforcement domains including language models (Sections 4.3 and 5.4).

arxiv-1806.01822a.png

[Image source. Click image to open in new window.]


arxiv-1806.01822d.png

[Image source. Click image to open in new window.]


  • Critique. While the DeepMind RMC model combined features of dynamic memory with an attention mechanism similar to Jason Weston’s DMN+ model (cited), they neither discuss nor compare the two models. Disappointingly, the DeepMind paper lacks ablation studies or other work needed to better understand their model: “… we cannot necessarily make any concrete claims as to the causal influence of our design choices on the model’s capacity for relational reasoning, or as to the computations taking place within the model and how they may map to traditional approaches for thinking about relational reasoning. Thus, we consider our results primarily as evidence of improved function – if a model can better solve tasks that require relational reasoning, then it must have an increased capacity for relational reasoning, even if we do not precisely know why it may have this increased capacity. ”

  • As shown in the first image, above, RMC module employs multi-head dot product attention (MHDPA) – Google’s Transformer seq2seq self-attention mechanism.

An aside regarding this DeepMind Relational Recurrent Neural Networks paper: another DeepMind paper, Relational Deep Reinforcement Learning  [discussion] (released at the same time) introduced an approach to deep reinforcement learning that improved upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It used the computationally efficient MHDPA self-attention model to iteratively reason about the relations between entities in a scene and to guide a model-free policy. In these models entity-entity relations are explicitly computed when considering the messages passed between connected nodes of the graph (i.e. the relations between entities in a scene). MHDPA computes interactions between those entities (attention weights); an (underlying) graph defines the path to a solution, with the attention weights driving the solution. [Very cool.]

arxiv-1806.01830d.png

[Image source. Click image to open in new window.]


  • This takes a minute to explain, but it’s a very neat game/task.

“Box-World” is a perceptually simple but combinatorially complex environment that requires abstract relational reasoning and planning. It consists of a 12 x 12 pixel room with keys and boxes randomly scattered. The room also contains an agent, represented by a single dark gray pixel, which can move in four directions: up, down, left, right. Keys are represented by a single colored pixel. The agent can pick up a loose key (i.e., one not adjacent to any other colored pixel) by walking over it. Boxes are represented by two adjacent colored pixels – the pixel on the right represents the box’s lock and its color indicates which key can be used to open that lock; the pixel on the left indicates the content of the box which is inaccessible while the box is locked.

To collect the content of a box the agent must first collect the key that opens the box (the one that matches the lock’s color) and walk over the lock, which makes the lock disappear. At this point the content of the box becomes accessible and can be picked up by the agent. Most boxes contain keys that, if made accessible, can be used to open other boxes. One of the boxes contains a gem, represented by a single white pixel. The goal of the agent is to collect the gem by unlocking the box that contains it and picking it up by walking over it. Keys that an agent has in possession are depicted in the input observation as a pixel in the top-left corner.

arxiv-1806.01830a.png

[Image source. Click image to open in new window.]


In each level there is a unique sequence of boxes that need to be opened in order to reach the gem. Opening one wrong box (a distractor box) leads to a dead-end where the gem cannot be reached and the level becomes unsolvable. There are three user-controlled parameters that contribute to the difficulty of the level: (1) the number of boxes in the path to the goal (solution length); (2) the number of distractor branches; (3) the length of the distractor branches. In general, the task is computationally difficult for a few reasons. First, a key can only be used once, so the agent must be able to reason about whether a particular box is along a distractor branch or along the solution path. Second, keys and boxes appear in random locations in the room, emphasising a capacity to reason about keys and boxes based on their abstract relations, rather than based on their spatial positions.

Figure 4 shows a trial run along with the visualization of the attention weights. For one of the attention heads, each key attends mostly to the locks that can be unlocked with that key. In other words, the attention weights reflect the options available to the agent once a key is collected. For another attention head, each key attends mostly to the agent icon. This suggests that it is relevant to relate each object with the agent, which may, for example, provide a measure of relative position and thus influence the agent’s navigation.

arxiv-1806.01830c.png

[Image source. Click image to open in new window.]


An Interpretable Reasoning Network for Multi-Relation Question Answering (Jun 2018) [code] is another very interesting paper which addressed multi-relation question answering via elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. They presented a novel Interpretable Reasoning Network (IRN) model that employed an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decided which part of an input question should be analyzed at each hop, for which the reasoning module predicted a knowledge base relation (relation triple) that corresponded to the current parsed result. The predicted relation was used to update the question representation as well as the state of the reasoning module, and helped the model to make the next-hop reasoning. At each hop, an entity was predicted based on the current state of the reasoning module.

arxiv-1801.04726a.png

[Image source. Click image to open in new window.]


arxiv-1801.04726b.png

[Image source. Click image to open in new window.]


arxiv-1801.04726c.png

[Image source. Click image to open in new window.]


  • IRN yielded state of the art results on two datasets. More interestingly and different from previous models, IRN offered traceable and observable intermediate predictions (see their Fig. 3), facilitating reasoning analysis and failure diagnosis (thereby also allowing manual manipulation in answer prediction). Whereas single-relation questions such as “How old is Obama?” can be answered by finding one fact triple in knowledge base/graph (this task has been widely studied), this work addressed multi-relation QA. Reasoning over multiple fact triples was required to answer multi-relation questions such as “Name a soccer player who plays at forward position at the club Borussia Dortmund.”, where more than one entity and relation are mentioned.

    On the datasets evaluated, IRN outperformed other baseline models such as Weston’s MemN2N model (see Table 2
    Source
    in the IRN paper). Through vector (space) representation, IRN could also establish reasonable mappings between knowledge base relations and natural language, such as linking “profession” to words like “working”, “profession”, and “occupation” (see their Table 4
    Source
    ), which addresses the issue of out-of-vocabulary (OOV) words.

Working memory is an essential component of reasoning – the process of forming an answer a new question by manipulating previously acquired knowledge. Memory modules are often implemented as a set of memory slots without explicit relational exchange of content, which does not naturally match multi-relational domains in which data is structured. Relational dynamic memory networks (Aug 2018) designed a new model, Relational Dynamic Memory Network (RDMN), to fill this gap. The memory could have single or multiple components, each of which realized a multi-relational graph of memory slots. The memory, dynamically updated in the reasoning process, was controlled by a central controller. The architecture is shown in their Fig. 1 (RDMN with single component memory): at the first step, the controller reads the query; the memory is initialized by the input graph, one node embedding per memory cell. Then during the reasoning process, the controller iteratively reads from and writes to the memory. Finally, the controller emits the output. RDMN performed well on several domains, including molecular bioactivity and chemical reactions.

  • Their Discussion provides an excellent summary (paraphrased here) that is relevant to this REVIEW:

    “The problem studied in this paper belongs to a broader program known as machine reasoning: unlike the classical focus on symbolic reasoning, here we aim for a learnable neural reasoning capability. We wish to emphasize that RDMN is a general model for answering any query about graph data. While the evaluation in this paper is limited to function calls graph, molecular bioactivity and chemical reaction, RDMN has a wide range of potential applications. For example, a drug (query) may act on the network of proteins as a whole (relational memory). In recommender systems, user can be modeled as a multi-relational graph (e.g., network between purchased items, and network of personal contacts); and query can be anything about them (e.g., preferred attributes or products). Similarly in healthcare, patient medical record can be modeled as multi-relational graphs about diseases, treatments, familial and social contexts; and query can be anything about the presence and the future of health conditions and treatments.”

    arxiv-1808.04247a.png

    [Image source. Click image to open in new window.]


    arxiv-1808.04247b.png

    [Image source. Click image to open in new window.]


    This work builds on preliminary work, described in Graph Memory Networks for Molecular Activity Prediction (Jan 2018) [non-author code].

Collectively, the works discussed above suggest that:

Memory-augmented neural networks such as MemN2N solve a compartmentalization problem with a slot-based memory matrix but may have a harder time allowing memories to interact/relate with one another once they are encoded, whereas LSTM pack all information into a common hidden memory vector, potentially making compartmentalization and relational reasoning more difficult (Relational Recurrent Neural Networks).

Denny Britz provided an excellent discussion of attention vs. memory in Attention and Memory in Deep Learning and NLP (Jan 2016). Also, Attention in Long Short-Term Memory Recurrent Neural Networks (Jun 2017) discussed a limitation of the LSTM-based encode-decoder architectures (i.e., fixed-length internal representations of the input sequence – note, e.g., ELMo) that attention mechanisms overcome: allowing the network to learn where to pay attention in the input sequence for each item in the output sequence.

Particularly relevant to this REVIEW are the examples of attention in textual entailment (drawn from the DeepMind paper Reasoning about Entailment with Neural Attention (Mar 2016) [non-author code here and here]) and text summarization (drawn from Jason Weston’s A Neural Attention Model for Abstractive Sentence Summarization) – the benefits of which are immediately obvious upon reviewing that work.

arxiv-1509.06664a.png

[Image source. Click image to open in new window.]


arxiv-1509.06664b.png

[Image source. Click image to open in new window.]


Also relevant to this discussion, in Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (May 2018) researchers at the P.G. Allen School (University of Washington) discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of S-RNN (simple RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Thus, the LSTM gates themselves are powerful recurrent models that provide more representational power than previously realized. They noted that:

  1. The LSTM weights are vectors, while attention typically computes scalar weights; i.e., a separate weighted sum is computed for every dimension of the LSTM’s memory cell;

  2. The weighted sum is accumulated with a dynamic program. This enables a linear rather than quadratic complexity in comparison to self-attention, but reduces the amount of parallel computation. This accumulation also creates an inductive bias of attending to nearby words, since the weights can only decrease over time.

  3. Attention has a probabilistic interpretation due to the softmax normalization, while the sum of weights in LSTM can grow up to the sequence length. In variants of the LSTM that tie the input and forget gate, such as coupled-gate LSTM and GRU, the memory cell instead computes a weighted average with a probabilistic interpretation. These variants compute locally normalized distributions via a product of sigmoids rather than globally normalized distributions via a single softmax.

They concluded:

  • “Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicate that LSTMs suffer little to no performance loss when removing the S-RNN. This provides evidence that the gating mechanism is doing the heavy lifting in modeling context. We further ablate the recurrence in each gate and find that this incurs only a modest drop in performance, indicating that the real modeling power of LSTMs stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allows us to mathematically relate LSTMs and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enables the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.”


In the recent language modeling domain, whereas ELMo employs stacked Bi-LSTM, and ULMFiT employs stacked LSTM [with no attention, shortcut connections (i.e., residual layers) or other sophisticated additions], OpenAI’s Finetuned Transformer LM is a simple network architecture based solely on attention mechanisms that entirely dispenses with recurrence and convolutions, yet attains state of the art results. OpenAI’s Finetuned Transformer LM is based on Google’s Transformer architecture. Finetuned Transformer LM surpassed the state of the art on neural machine translation tasks, and generalized well to other tasks. The Transformer model, based entirely on attention, replaced RNN with a multi-head attention that consisted of multiple attention layers.

In July 2018, nearly a year after they introduced their original “Attention Is All You NeedTransformer architecture (Jun 2017; updated Dec 2017), Google Brain/DeepMind released an updated Universal Transformer version, discussed in the Google AI blog post Moving Beyond Translation with the Universal Transformer [Aug 2018;  discussion]:

arxiv-1807.03819b.png

[Image source. Click image to open in new window.]


arxiv-1807.03819a.png

[Image source (there is a more detailed schematic in Appendix A in that paper). Click image to open in new window.]


  • “In Universal Transformer [code, described in Tensor2Tensor for Neural Machine Translation] we extend the standard Transformer to be computationally universal (Turing complete) using a novel, efficient flavor of parallel-in-time recurrence which yields stronger results across a wider range of tasks. We built on the parallel structure of Transformer to retain its fast training speed, but we replaced Transformer’s fixed stack of different transformation functions with several applications of a single, parallel-in-time recurrent transformation function (i.e. the same learned transformation function is applied to all symbols in parallel over multiple processing steps, where the output of each step feeds into the next).

    “Crucially, where an RNN processes a sequence symbol-by-symbol (left to right), Universal Transformer processes all symbols at the same time (like the Transformer), but then refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention. This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNN, and also makes the Universal Transformer more powerful than the standard feedforward Transformer. …”

The performance benchmarks for Universal Transformer on the bAbI dataset (especially the more difficult “10k examples”) are particularly impressive (Table 1
Source
in their paper; note also the MemN2N comparison). Appendix C shows the bAbI attention visualizations, of which the last example is particularly impressive (requiring three supportive facts to solve).

In August 2018 Google AI followed their Universal Transformers paper with Character-Level Language Modeling with Deeper Self-Attention [discussion], which showed that a deep (64-layer) Transformer model with fixed context outperformed RNN variants by a large margin, achieving state of the art on two popular benchmarks.

arxiv-1808.04444a.png

[Image source. Click image to open in new window.]


arxiv-1808.04444b.png

[Image source. Click image to open in new window.]


  • LSTM and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts.

  • While code is not yet released (2018-08-16), it will likely appear in Google’s TensorFlow tensor2tensor GitHub repository, “home” of their Transformer code.

  • For reference, in Learning to Generate Reviews and Discovering Sentiment (Apr 2017) OpenAI also trained a byte (character)-level RNN-based language model (a single layer multiplicative LSTM with 4096 units trained for a single epoch on an Amazon product review dataset), that even with data-parallelism across 4 Pascal Titan X GPU for which training took approximately one month.

    • However, RNN/CNN handle input sequences sequentially word-by-word which is an obstacle to parallelization. I am unsure how long it takes to train Google’s Transformer algorithm, which achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence, leading to significantly shorter training times (The Transformer – Attention is All You Need).

      This GitHub Issue discusses parallelization over GPU and training times, etc., indicating that results are GPU-number and batch size dependent. The Annotated Transformer also discusses this; under their setup, (8 NVIDIA P100 GPU; parametrization; …) they trained the base models for a total of 100,000 steps (12 hrs); big models were trained for 300,000 steps (3.5 days).

Like Google, Facebook AI Research has also developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation (May 2018) [code/pretrained modelsdiscussion]), demonstrated via story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation).

  • They tackled the challenges of story-telling with a hierarchical model, which first generated a sentence called “the prompt” (describing the topic for the story), and then “conditioned” on this prompt when generating the story. Conditioning on the prompt or premise made it easier to generate consistent stories, because they provided grounding for the overall plot. It also reduced the tendency of standard sequence models to drift off topic.

  • To improve the relevance of the generated story to its prompt, they adopted the fusion mechanism from Cold Fusion: Training Seq2Seq Models Together with Language Models):

    The cold fusion mechanism of Sriram et al. (2017) pretrains a language model and subsequently trains a seq2seq model with a gating mechanism that learns to leverage the final hidden layer of the language model during seq2seq training [their language model contained three layers of gated recurrent units (GRUs)]. The model showed, for the first time, that fusion mechanisms could help seq2seq models build dependencies between their input and output.

  • To improve over the pretrained model, the second model had to focus on the link between the prompt and the story. Since existing convolutional architectures only encode a bounded amount of context, they introduced a novel gated self-attention mechanism that allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context).

  • Similar to Google’s Transformer, Facebook AI Research used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (see Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with gated linear unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

    arxiv-1805.04833a.png

    [Image source. Click image to open in new window.]


    arxiv-1805.04833b.png

    [Image source. Click image to open in new window.]


    arxiv-1805.04833c.png

    [Image source. Click image to open in new window.]


Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding (Aug 2018) proposed a new self-attention mechanism for sentence embedding, Dynamic Self-Attention (DSA). They designed DSA by modifying dynamic routing in capsule networks for use in NLP. DSA attended to informative words with a dynamic weight vector, achieving new state of the art results among sentence encoding methods on the Stanford Natural Language Inference (SNLI) dataset – with the least number of parameters – while showing comparative results in Stanford Sentiment Treebank (SST) dataset. With the dynamic weight vector, the self attention mechanism could be furnished with flexibility, rendering it more effective for sentence embedding.

arxiv-1808.07383a.png

[Image source. Click image to open in new window.]


arxiv-1808.07383b.png

[Image source. Click image to open in new window.]


Learning to Compose Neural Networks for Question Answering (Jun 2016) [codeauthor discussion] presented a compositional, attentional model for answering questions about a variety of world representations including images and structured knowledge bases. The model used natural language strings to automatically assemble neural networks from a collection of composable modules. Parameters for these modules were learned jointly with network-assembly parameters via reinforcement learning, with only (world, question, answer) triples as supervision. This approach, termed a Dynamic Neural Model Network, achieved state of the art results on benchmark datasets in both visual and structured domains. The model “translates” from questions to dynamically assembled neural networks, then applies these networks to world representations (images or knowledge bases) to produce answers. The model has two components, trained jointly: a collection of neural “modules” that can be freely composed, and a network layout predictor that assembles modules into complete deep networks tailored to each question (see their Figure 1). Training data consisted of (world, question, answer) triples: the approach required no supervision of the network layouts. They achieved state of the art performance on two markedly different question answering tasks: questions about natural images, and more compositional questions about United States geography.

arxiv-1601.01705d.png

[Image source. Click image to open in new window.]


Relevant to the following paragraph, in NLP parts of speech (POS) content words are words that name objects of reality and their qualities. They signify actual living things (dog, cat, etc.), family members (mother, father, sister, etc.), natural phenomena (snow, Sun, etc.) common actions (do, make, come, eat, etc.), characteristics (young, cold, dark, etc.), etc. Content words consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words. Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as prepositions (in, out, under, etc.), pronouns (I, you, he, who, etc.), conjunctions (and, but, till, as, etc.), etc.

Most models based on the seq2seq model with an encoder-decoder framework are equipped with an attention mechanism, like Google’s Transformer mechanism. However, these conventional attention mechanisms treat the decoding at each time step equally with the same matrix, which is problematic since the softness of the attention for different types of words (e.g. content words and function words) should differ. Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation (Aug 2018) [code: not yet available, 2018-10-10] addressed this issue, proposing a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an attention temperature. They set a temperature parameter which could be learned by the model based on the attentions in the previous decoding time steps, as well as the output of the decoder at the current time step. With the temperature parameter, the model was able to automatically tune the degree of softness of the distribution of the attention scores. Specifically, the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. In a neural machine translation task, they showed that SACT could attend to the most relevant elements in the source-side contexts, generating translation of high quality.

Attention and Memory:

Additional Reading

  • Pay Less Attention with Lightweight and Dynamic Convolutions (ICLR 2019) [discussion]

    “We presented lightweight convolutions which perform competitively to the best reported results in the literature despite their simplicity. They have a very small parameter footprint and the kernel does not change over time-steps. This demonstrates that self-attention is not critical to achieve good accuracy on the language tasks we considered. Dynamic convolutions build on lightweight convolutions by predicting a different kernel at every time-step, similar to the attention weights computed by self-attention. The dynamic weights are a function of the current time-step only rather than the entire context. Our experiments show that lightweight convolutions can outperform a strong self-attention baseline on WMT’17 Chinese-English translation, IWSLT’14 German-English translation and CNN-DailyMail summarization. Dynamic convolutions improve further and achieve a new state of the art on the test set of WMT’14 English-German. Both lightweight convolution and dynamic convolution are 20% faster at runtime than self-attention. On Billion Word language modeling we achieve comparable results to self-attention,”

    PayLessAttention-a.png

    [Image source. Click image to open in new window.]


    PayLessAttention-b.png

    [Image source. Click image to open in new window.]


  • Long Short-Term Attention (Oct 2018) [see also]

    “In order to learn effective features from temporal sequences, the long short-term memory (LSTM) network is widely applied. A critical component of LSTM is the memory cell, which is able to extract, process and store temporal information. Nevertheless, in LSTM, the memory cell is not directly enforced to pay attention to a part of the sequence. Alternatively, the attention mechanism can help to pay attention to specific information of data. In this paper, we present a novel neural model, called long short-term attention (LSTA), which seamlessly merges the attention mechanism into LSTM. More than processing long short term sequences, it can distill effective and valuable information from the sequences with the attention mechanism. Experiments show that LSTA achieves promising learning performance in various deep learning tasks.”

    arxiv1810.1275-f1+f2.png

    [Image source. Click image to open in new window.]


    arxiv1810.1275-t1+f5+f6+t2.png

    [Image source. Click image to open in new window.]


  • Those [same authors] contemporaneously co-published this paper, Recurrent Attention Unit. [see also]

    “Recurrent Neural Network (RNN) has been successfully applied in many sequence learning problems. Such as handwriting recognition, image description, natural language processing and video motion analysis. After years of development, researchers have improved the internal structure of the RNN and introduced many variants. Among others, Gated Recurrent Unit (GRU) is one of the most widely used RNN model. However, GRU lacks the capability of adaptively paying attention to certain regions or locations, so that it may cause information redundancy or loss during leaning. In this paper, we propose a RNN model, called Recurrent Attention Unit (RAU), which seamlessly integrates the attention mechanism into the interior of GRU by adding an attention gate. The attention gate can enhance GRU’s ability to remember long-term memory and help memory cells quickly discard unimportant content. RAU is capable of extracting information from the sequential data by adaptively selecting a sequence of regions or locations and pay more attention to the selected regions during learning. Extensive experiments on image classification, sentiment classification and language modeling show that RAU consistently outperforms GRU and other baseline methods.”

    arxiv1810.12754-f1+f2.png

    [Image source. Click image to open in new window.]


    arxiv1810.12754-t1+t2+t3.png

    [Image source. Click image to open in new window.]


  • You May Not Need Attention (Oct 2018) [code;   author’s discussion]

    “In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decoder. Our eager translation model is low-latency, writing target tokens as soon as it reads the first source token, and uses constant memory during decoding. It performs on par with the standard attention-based model of Bahdanau et al. (2014), and better on long sentences.”

    arxiv1810.13409-f1+f2+t1+t2+t3.png

    [Image source. Click image to open in new window.]


  • Convolutional Self-Attention Network (Oct 2018)

    Self-attention network (SAN ) has recently attracted increasing interest due to its fully parallelized computation and flexibility in modeling dependencies. It can be further enhanced with multi-headed attention mechanism by allowing the model to jointly attend to information from different representation subspaces at different positions (Vaswani et al., 2017). In this work, we propose a novel convolutional self-attention network (CSAN), which offers SAN the abilities to (1) capture neighboring dependencies, and (2) model the interaction between multiple attention heads. Experimental results on WMT14 English-to-German translation task demonstrate that the proposed approach outperforms both the strong Transformer baseline and other existing works on enhancing the locality of SAN. Comparing with previous work, our model does not introduce any new parameters.”

    arxiv1810.13320-f1.png

    [Image source. Click image to open in new window.]


    arxiv1810.13320-t1.png

    [Image source. Click image to open in new window.]
  • An Introductory Survey on Attention Mechanisms in NLP Problems (Nov 2018)

    “First derived from human intuition, later adapted to machine translation for automatic token alignment, attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this paper, we survey through recent works and conduct an introductory summary of the attention mechanism in different NLP problems, aiming to provide our readers with basic knowledge on this widely used method, discuss its different variants for different tasks, explore its association with other techniques in machine learning, and examine methods for evaluating its performance.”


[Table of Contents]

Attention: Miscellaneous Applications

Although the following content is more NLP task related, I wanted to group this content close to the discussions of language models and attentional mechanisms, discussed in my “Attention and Memory subsection. Recent applications of Google’s “Transformer” and other attentional architectures relevant to this REVIEW include their use in NLP orientated tasks such as “slot filling” (relation extraction), question answering, and document summarization.

Position-aware Self-attention with Relative Positional Encodings for Slot Filling (Bilan and Roth, July 2018) applied self-attention with relative positional encodings to the task of relation extraction; their model relied solely on attention: no recurrent or convolutional layers were used. The authors employed Google’s Transformer seq2seq model, also known as as multi-head dot product attention (MHDPA) or “self-attention.”

  • Despite citing Zhang et al. (Stanford University; coauthored by Christopher Manning)’s 2017 paper Position-aware Attention and Supervised Data Improve Slot Filling and using the TACRED relation extraction dataset introduced by Zhang et al. in their paper, Bilan and Roth claim

    To the best of our knowledge, the transformer model has not yet been applied to relation classification as defined above (as selecting a relation for two given entities in context). ”

    Furthermore, they provide no code, while Zhang et al. released their code, and included ablation studies in their work. The attention mechanism used by Zhang et al. differed significantly from the Google Transformer model in their use of the summary vector and position embeddings, and the way their attention weights were computed. While Zhang et al.’s $\small F_1$ scores (their Table 4) were slightly lower than Bilan and Roth’s on the TACRED dataset (see Bilan and Roth’s Table 1), the ensemble model used by Zhang et al. had the best scores. Sample relations extracted from a sentence are shown in Fig. 1 and Table 1 in Zhang et al.

    Bilan2018_Zhang2017.png

    [Click image to open in new window.]


Bidirectional Attention Flow for Machine Comprehension (Nov 2016; updated Jun 2018) introduced the BiDAF framework, a multi-stage hierarchical process that used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization. BiDAF was subsequently used in QA4IE: A Question Answering based Framework for Information Extraction (Apr 2018) (jointly discussed with the BiDAF paper discussion), a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, also using a knowledge base (Wikipedia Ontology) for entity recognition.

Related to attention-based relation extraction, Neural Architectures for Open-Type Relation Argument Extraction (Sep 2018) redefined the problem of slot filling to the task of Open-type Relation Argument Extraction (ORAE): given a corpus, a query entity $\small Q$ and a knowledge base relation (e.g.,”$\small Q$ authored notable work with title $\small X$”), the model had to extract an argument of “non-standard entity type” (entities that cannot be extracted by a standard named entity tagger) from the corpus – hence, “open-type argument extraction.” This work also employed the Transformer architecture, used as a multi-headed self-attention mechanism in their encoders for computing sentence representations suitable for argument extraction.

The approach for ORAE had two conceptual advantages. First, it was more general than slot-filling as it was also applicable to non-standard named entity types that could not be dealt with previously. Second, while the problem they defined was more difficult than standard slot filling, they eliminated an important source of errors: tagging errors that propagate throughout the pipeline and that are notoriously hard to correct downstream. A wide range of neural network architectures to solve ORAE were examined, each consisting of a sentence encoder, which computed a vector representation for every sentence position, and an argument extractor, which extracted the relation argument from that representation. The combination of a RNN encoder with a CRF extractor gave the best results, +7% absolute $\small \text{F-measure}$ better than a previously proposed adaptation of a state of the art question answering model (BiDAF). [“The dataset and code will be released upon publication.” – not available, 2018-10-10.]

arxiv-1803.01707d.png

[Image source. Click image to open in new window.]


Generating Wikipedia by Summarizing Long Sequences (Jan 2018), by Google Brain, employed Wikipedia in a supervised machine learning task for multi-document summarization, using extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. They modified their Transformer architecture to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. These improvements allowed them to generate entire Wikipedia articles.

Because the amount of text in input reference documents can be very large (see their Table 2) it was infeasible to train an end-to-end abstractive model, given the memory constraints of current hardware. Hence, they first coarsely selected a subset of the input using extractive summarization. The second stage involved training an abstractive model that generated the Wikipedia text while conditioning on this extraction. This two-stage process was inspired by by how humans might summarize multiple long documents: first highlighting pertinent information, then conditionally generating the summary based on the highlights.

Hierarchical Bi-Directional Attention-based RNNs for Supporting Document Classification on Protein-Protein Interactions Affected by Genetic Mutations (Jan 2018) [code] leveraged word embeddings trained on PubMed abstracts. The authors argued that the title of a paper usually contains important information that is more salient than a typical sentence in the abstract; they therefore proposed a shortcut connection (i.e., residual layer) that integrated the title vector representation directly to the final feature representation of the document. They concatenated the sentence vector that represented the title and the vectors of the abstract, to the document feature vector used as input to the task classifier. This system ranked first among the Document Triage Task of the BioCreative VI Precision Medicine Track.

Fergadis2018a.png

[Image source. Click image to open in new window.]


Fergadis2018b.png

[Image source. Click image to open in new window.]


Critique:

  • The “spirit” of the BioCreative VI Track 4: Mining protein interactions and mutations for precision medicine (PM) is (bolded emphasis mine, below):

    “The precision medicine initiative (PMI) promises to identify individualized treatment depending on a patients’ genetic profile and their related responses. In order to help health professionals and researchers in the precision medicine endeavor, one goal is to leverage the knowledge available in the scientific published literature and extract clinically useful information that links genes, mutations, and diseases to specialized treatments. … Understanding how allelic variation and genetic background influence the functionality of these pathways is crucial for predicting disease phenotypes and personalized therapeutical approaches. A crucial step is the mapping of gene products functional regions through the identification and study of mutations (naturally occurring or synthetically induced) affecting the stability and affinity of molecular interactions.”

  • Against those criteria and despite the title of this paper and this excerpt from the paper,

    “In order to incorporate domain knowledge in our system, we annotate all biomedical named entities namely genes, species, chemical, mutations and diseases. Each entity mention is surround by its corresponding tags as in the following example: Mutations in <species>human</species> <gene>EYA1</gene> cause <disease>branchio-oto-renal (BOR) syndrome</disease> …”

    … there is no evidence that mutations (i.e. genomic variants) were actually tagged. Mutations/variants are not discussed, nor is there any mention of “mutant” or “mutation” in their GitHub repository/code nor the parent repo.

  • Richard Socher and colleagues [SalesForce: A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks (Jul 2017)] also used shortcut connections (i.e., residual layers) from higher layers to lower layers (lower-level task predictions), reflecting linguistic hierarchies.

Identifying interactions between proteins is important to understand underlying biological processes. Extracting protein-protein interactions (PPI) from raw text is often very difficult. Previous supervised learning methods have used handcrafted features on human-annotated data sets. Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2108) proposed a novel tree recurrent neural network with structured attention architecture for doing PPI. Their architecture achieved state of the art results on the benchmark AIMed and BioInfer data sets; moreover, their models achieved significant improvement over previous best models without any explicit feature extraction. Experimental results showed that traditional recurrent networks had inferior performance compared to tree recurrent networks, for the supervised PPI task.

“… we propose a novel neural net architecture for identifying protein-protein interactions from biomedical text using a Tree LSTM with structured attention. We provide an in depth analysis of traversing the dependency tree of a sentence through a child sum tree LSTM and at the same time learn this structural information through a parent selection mechanism by modeling non-projective dependency trees.”

arxiv1503.00075-f1+f2.png

[Image source (Kai Sheng Tai, Richard Socher, Christopher D. Manning). Click image to open in new window.]


arxiv1808.03227-f1.png

[Image source. Click image to open in new window.]


arxiv1808.03227-f2.png

[Image source. Click image to open in new window.]


arxiv1808.03227-t3.png

[Image source. Click image to open in new window.]


[Table of Contents]

Pointer Mechanisms; Pointer-Generators

Pointer-generator mechanisms were introduced by See et al. [Christopher Manning] in Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017). Pointer-generator architectures can copy words from source texts via a pointer , and generate novel words from a vocabulary via a generator . With the pointing/copying mechanism factual information can be reproduced accurately, and out of vocabulary words can also be taken care of in the summaries [Neural Abstractive Text Summarization with Sequence-to-Sequence Models (Dec 2018)].

arxiv1812.02303-f4.png

[Image source. Click image to open in new window.]


arxiv1704.04368-f3.png

[Image source. Click image to open in new window.]




  • Paulus et al. [Richard Socher] A Deep Reinforced Model for Abstractive Summarization (Nov 2017)

    • “To generate a token, our decoder uses either a token-generation softmax layer or a pointer mechanism to copy rare or unseen from the input sequence. We use a switch function that decides at each decoding step whether to use the token generation or the pointer (Gulcehre et al., 2016Nallapati et al., 2016).”

    • “Neural Encoder-Decoder Sequence Models. Neural encoder-decoder models are widely used in NLP applications such as machine translation, summarization, and question answering. These models use recurrent neural networks (RNN), such as long-short term memory network (LSTM) to encode an input sentence into a fixed vector, and create a new output sequence from that vector using another RNN. To apply this sequence-to-sequence approach to natural language, word embeddings are used to convert language tokens to vectors that can be used as inputs for these networks.

      Attention mechanisms make these models more performant and scalable, allowing them to look back at parts of the encoded input sequence while the output is generated. These models often use a fixed input and output vocabulary, which prevents them from learning representations for new words. One way to fix this is to allow the decoder network to point back to some specific words or sub-sequences of the input and copy them onto the output sequence. Gulcehre et al. (2016) and Merity et al. (2017) combine this pointer mechanism with the original word generation layer in the decoder to allow the model to use either method at each decoding step.

  • See et al. [Christopher Manning] Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017)

    • “Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing , which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator . Second, we use coverage to keep track of what has been summarized, which discourages repetition.”

    • “The pointer network (Vinyals et al., 2015) is a sequence-to-sequence model that uses the soft attention distribution of Bahdanau et al. (2015) to produce an output sequence consisting of elements from the input sequence. … Our approach is considerably different from that of Gulcehre et al. (2016)Nallapati et al. (2016). Those works train their pointer components to activate only for out-of-vocabulary words or named entities (whereas we allow our model to freely learn when to use the pointer), and they do not mix the probabilities from the copy distribution and the vocabulary distribution. We believe the mixture approach described here is better for abstractive summarization – in section 6 we show that the copy mechanism is vital for accurately reproducing rare but in-vocabulary words, and in section 7.2 we observe that the mixture model enables the language model and copy mechanism to work together to perform abstractive copying.”

    • “Our hybrid pointer-generator network facilitates copying words from the source text via pointing (Vinyals et al., 2015), which improves accuracy and handling of OOV words, while retaining the ability to generate new words. The network, which can be viewed as a balance between extractive and abstractive approaches, is similar to Gu et al.’s (2016) CopyNet
      Source
      and Miao and Blunsom’s (2016) Forced-Attention Sentence Compression
      Source
      , that were applied to short-text summarization. We propose a novel variant of the coverage vector
      Source
      (Tu et al., 2016) from Neural Machine Translation, which we use to track and control coverage of the source document. We show that coverage is remarkably effective for eliminating repetition.”

      arxiv1704.04368-f1.png

      [Image source. Click image to open in new window.]


      arxiv1704.04368-f3.png

      [Image source. Click image to open in new window.]


      arxiv1704.04368-f5.png

      [Image source. Click image to open in new window.]


  • Merity et al. [Richard Socher | MetaMind/Salesforce] Pointer Sentinel Mixture Models (Sep 2016):

    • See also the description of the dataset they created, WikiText-103.

    • “Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM.”

    • “Pointer networks (Vinyals et al., 2015) provide one potential solution for rare and out of vocabulary (OOV) words as a pointer network uses attention to select an element from the input as output. This allows it to produce previously unseen input tokens. While pointer networks improve performance on rare words and long-term dependencies they are unable to select words that do not exist in the input.

      “We introduce a mixture model, illustrated in Fig. 1, that combines the advantages of standard softmax classifiers with those of a pointer component for effective and efficient language modeling. Rather than relying on the RNN hidden state to decide when to use the pointer, as in the recent work of Gulcehre et al. (2016), we allow the pointer component itself to decide when to use the softmax vocabulary through a sentinel. The model improves the state of the art perplexity on the Penn Treebank. Since this commonly used dataset is small and no other freely available alternative exists that allows for learning long range dependencies, we also introduce a new benchmark dataset for language modeling called WikiText.”

      arxiv1609.07843-f1.png

      [Image source. Click image to open in new window.]


      arxiv1609.07843-f2.png

      [Image source. Click image to open in new window.]


      arxiv1609.07843-f7.png

      [Image source. Click image to open in new window.]


  • Nallapati et al. [Caglar Gulcehre] Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (Aug 2016)

    • “Often-times in summarization, the keywords or named-entities in a test document that are central to the summary may actually be unseen or rare with respect to training data. Since the vocabulary of the decoder is fixed at training time, it cannot emit these unseen words. Instead, a most common way of handling these out-of-vocabulary (OOV) words is to emit an ‘UNK’ token as a placeholder. However this does not result in legible summaries. In summarization, an intuitive way to handle such OOV words is to simply point to their location in the source document instead. We model this notion using our novel switching decoder/pointer architecture which is graphically represented in Figure 2. In this model, the decoder is equipped with a ‘switch’ that decides between using the generator or a pointer at every time-step. If the switch is turned on, the decoder produces a word from its target vocabulary in the normal fashion. However, if the switch is turned off, the decoder instead generates a pointer to one of the word-positions in the source. The word at the pointer-location is then copied into the summary. The switch is modeled as a sigmoid activation function over a linear layer based on the entire available context at each time-step as shown below. …”

    • “The pointer mechanism may be more robust in handling rare words because it uses the encoder’s hidden-state representation of rare words to decide which word from the document to point to. Since the hidden state depends on the entire context of the word, the model is able to accurately point to unseen words although they do not appear in the target vocabulary. [Even when the word does not exist in the source vocabulary, the pointer model may still be able to identify the correct position of the word in the source since it takes into account the contextual representation of the corresponding ‘UNK’ token encoded by the RNN. Once the position is known, the corresponding token from the source document can be displayed in the summary even when it is not part of the training vocabulary either on the source side or the target side.] …”

      arxiv1602.06023-f2.png

      [Image source. Click image to open in new window.]


      arxiv1602.06023-fig3.png

      [Image source. Click image to open in new window.]


      arxiv1602.06023-f4.png

      [Image source. Click image to open in new window.]


  • Gulcehre et al. [Yoshua Bengio], Pointing the Unknown Words (Aug 2016)

    • “The attention-based pointing mechanism is introduced first in the pointer networks (Vinyals et al., 2015). In the pointer networks, the output space of the target sequence is constrained to be the observations in the input sequence (not the input space). Instead of having a fixed dimension softmax output layer, softmax outputs of varying dimension is dynamically computed for each input sequence in such a way to maximize the attention probability of the target input. However, its applicability is rather limited because, unlike our model, there is no option to choose whether to point or not; it always points. In this sense, we can see the pointer networks as a special case of our model where we always choose to point a context word.”

  • Vinyals O et al., Pointer Networks (Jun 2015; updated Jan 2017)

    • “We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). …”

      arxiv1506.03134-f1.png

      [Image source. Click image to open in new window.]


[Table of Contents]

Language Models

A recent and particularly exciting advance in NLP is the development of pretrained language models such as

Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks.


ImageNet [datasetcommunity challengesdiscussion  (local copy)] and related image classification, segmentation, and captioning challenges have had an enormous impact on the advancement of computer vision, deep learning, deep learning architecture, the use of pretrained modelstransfer learningattentional mechanisms, etc. Studies arising from the ImageNet dataset have as well identified gaps in our understanding of that success, leading to demystifying how those deep neural networks classify images (explained very well in this excellent video, below), and other issues including the vulnerability of deep learning to the impact of adversarial attacks. It is anticipated that pretrained language models will have a parallel impact, in the NLP domain.

The availability of pretrained models is an important and practical advance in machine learning, as many of the current processing tasks in image processing and NLP language modeling are extremely computationally intensive. For example:

[Table of Contents]

ELMo

ELMo (“Embeddings from Language Models”) was introduced in Deep Contextualized Word Representations (Feb 2018; updated Mar 2018) [project;  tutorials here and here;  discussion here, here and here] by authors at the Allen Institute for Artificial Intelligence and the Paul G. Allen School of Computer Science & Engineering at the University of Washington. ELMo modeled both the complex characteristics of word use (e.g., syntax and semantics), and how these characteristics varied across linguistic contexts (e.g., to model polysemy: words or phrases with different, but related, meanings). These word vectors were learned functions of the internal states of a deep bidirectional language model (two Bi-LSTM layers), which was pretrained on a large text corpus.

Unlike most widely used word embeddings, ELMo word representations were deep, in that they were a function of the internal, hidden layers of the bi-directional Language Model (biLM), providing a very rich representation. More specifically, ELMo learned a linear combination of the vectors stacked above each input word for each end task, which markedly improved performance over using just the top LSTM layer. These word vectors could be easily added to existing models, significantly improving the state of the art across a broad range of challenging NLP problems including question answering, textual entailment, semantic role labelingcoreference resolutionnamed entity extraction, and sentiment analysis. The addition of *ELMo representations alone significantly improved the state of the art in every case, including up to 20% relative error reductions.*

ELMo.png

[Image source (based on data from Table 1 in arXiv:1802.05365. Click image to open in new window.
Tasks: SQuAD: question answering; SNLI: textual entailment; SRL: semantic role labeling; Coref: coreference resolution; NER: named entity recognition; SST-5: sentiment analysis. SOTA: state of the art.]


[Table of Contents]

ULMFiT

Jeremy Howard (fast.ai and the University of San Francisco) and Sebastian Ruder (Insight Centre for Data Analytics, NUI Galway, and Aylien Ltd.) described their Universal Language Model Fine-tuning (ULMFiT) model in Universal Language Model Fine-tuning for Text Classification (Jan 2018; updated May 2018) [project/code;  code here  herehere and here;  discussion herehereherehereherehere and here]. Their language model was a transfer learning method that could be applied to any task in NLP [but as of July 2018 they had only studied its use in classification tasks] as well as key techniques for fine-tuning a language model. They also provided the fastai.text and fastai.lm_rnn modules necessary to train/use their ULMFiT models.

ULMFiT significantly outperformed the state of the art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, ULMFiT matched the performance of training from scratch on 100x more data.*

arxiv-1801.06146.png

[Image source. Click image to open in new window.]


[Table of Contents]

Finetuned Transformer LM

Finetuned Transformer LM  (Radford et al., Improving Language Understanding by Generative Pre-Training) (Jun 2018) [projectcode;  discussion: here and here] was introduced by Ilya Sutskever and colleagues at OpenAI. They demonstrated that large gains on diverse natural language understanding (NLU) tasks – such as textual entailment, question answering, semantic similarity assessment, and document classification – could be realized by a two stage training procedure: generative pretraining of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, they made use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

Finetuned Transformer LM [aka: OpenAI Transformer  |  OpenAI GPT] provided a convincing example that pairing supervised learning methods with unsupervised pretraining works very well, demonstrating the effectiveness of their approach on a wide range of NLU benchmarks. Their general task-agnostic model outperformed discriminatively trained models that used architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied (see the table
Source
at Improving Language Understanding with Unsupervised Learning). For instance, they achieved absolute improvements of 8.9% on commonsense reasoning, 5.7% on question answering, and 1.5% on textual entailment (natural language inference).

OpenAI_Transformer.png

[Image source. Click image to open in new window.]


  • The architecture employed in Improving Language Understanding by Generative Pre-Training, Finetuned Transformer LM, was Google’s Transformer, a seq2seq based self-attention mechanism. This model choice provided OpenAI with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, they utilized task-specific input adaptations derived from traversal-style approaches, which processed structured text input as a single contiguous sequence of tokens. As they demonstrated in their experiments, these adaptations enabled them to fine-tune effectively with minimal changes to the architecture of the pretrained model. OpenAI’s Finetuned Transformer LM model largely followed the original (Google’s Attention Is All You Need) Transformer work: OpenAI trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads) …

  • The Transformer architecture does not use RNN (LSTM), relying instead on the use of the self-attention mechanism. In Improving Language Understanding by Generative Pre-Training, the OpenAI authors asserted that the use of LSTM models employed in ELMo and ULMFiT restricted the prediction ability of those language models to a short range. In contrast, OpenAI’s choice of Transformer networks allowed them to capture longer-range linguistic structure. Regarding better understanding of why the pretraining of language models by Transformer architectures was effective, they hypothesized that the underlying generative model learned to perform many of the evaluated tasks in order to improve its language modeling capability, and that the more structured attentional memory of the Transformer assisted in transfer, compared to LSTM.

[Table of Contents]

BERT

In October 2018 Google Language AI presented BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [code: Google Research | author’s discussion]. Unlike recent language representation models, BERT – which stands for Bidirectional Encoder Representations from Transformers – is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, pre-trained BERT representations can be fine-tuned with just one additional output layer to create state of the art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT obtained new state of the art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

  • For a description of the GLUE datasets used in this paper, refer here.

  • [Google AI Blog: Nov 2018 – short, very descriptive summary (local copy)] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

    “… What Makes BERT  Different? BERT builds upon recent work in pre-training contextual representations - including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia). …”

    “…The Strength of Bidirectionality, If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model. To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:

    BERT_ex1.png

    [Image source. Click image to open in new window.]


    “While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network. BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:

    BERT_ex2.png

    [Image source. Click image to open in new window.]


    “… On SQuAD v1.1, BERT achieves 93.2% $\small F_1$ score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%. … BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them. …”

  • Community discussion hereherehere, and here: Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks

  • Non-author code

arxiv1810.04805a.png

[Image source. BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. Click image to open in new window.]


arxiv1810.04805b.png

[Image source. Click image to open in new window.]


arxiv1810.04805c.png

[Image source. Click image to open in new window.]


In the following figure, note that the results in Table 2 were on the less-challenging (viz-a-viz SQuAD2.0) SQuAD1.1 QA dataset:

arxiv1810.04805g.png

[Image source. Click image to open in new window.]


Some highlights, excerpted from Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks:

  • NLP researchers are exploiting today’s large amount of available language data and maturing transfer learning techniques to develop novel pre-training approaches. They first train a model architecture on one language modeling objective, and then fine-tune it for a supervised downstream task. Aylien Research Scientist Sebastian Ruder suggests in his blog that pre-trained models may have “the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.”

  • The BERT model architecture is a bidirectional Transformer encoder. The use of [Google’s] Transformer comes as no surprise – this is a recent trend due Transformers’ training efficiency and superior performance in capturing long-distance dependencies compared to a recurrent neural network architecture. The bidirectional encoder meanwhile is a standout feature that differentiates BERT from OpenAI GPT [i.e. Finetuned Transformer LM | OpenAI Transformer – a left-to-right Transformer] and ELMo (a concatenation of independently trained left-to-right and right- to-left LSTM).

  • BERT is a huge model, with 24 Transformer blocks, 1024 hidden layers, and 340M parameters.

  • The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). The model runs on 16 TPU pods for training.

  • In the pre-training process, researchers took an approach which involved randomly masking a percentage of the input tokens (15 percent) to train a deep bidirectional representation. They refer to this method as a Masked Language Model (MLM).

  • A pre-trained language model cannot understand relationships between sentences, which is vital to language tasks such as question answering and natural language inferencing. Researchers therefore pre-trained a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.

  • The fine-tuned model for different datasets improves the GLUE benchmark to 80.4 percent (7.6 percent absolute improvement), MultiNLI accuracy to 86.7 percent (5.6 percent absolute improvement), the SQuAD1.1 question answering test $\small F_1$ to 93.2 (1.5 absolute improvement), and so on over a total of 11 language tasks.

[Table of Contents]

Transformer

In mid-2017, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks in an encoder-decoder configuration: the best performing models connected the encoder and decoder through an attention mechanism. In June 2017 Vaswani et al. at Google proposed a new simple network architecture, Transformer, that was based solely on attention mechanisms – thus dispensing with recurrence and convolutions entirely, also allowing for significantly more parallelization (Attention Is All You Need (Jun 2017; updated Dec 2017) [code]). [The inherently sequential nature of RNN precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.] Transformer has been shown to perform strongly on machine translation, document generation, syntactic parsing and other tasks. Experiments on two machine translation tasks showed these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Transformer also generalized well to other tasks; for example, it was successfully applied to English constituency parsing, both with large and limited training data.

arxiv-1706.03762a.png

[Image source. Click image to open in new window.]


arxiv-1706.03762b.png

[Image source. Click image to open in new window.]


  • Transformer is discussed in Google AI’s August 2017 blog post Transformer: A Novel Neural Network Architecture for Language Understanding:

    • “… The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations. The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.”
        transform20fps.gif
        [Image source. Click image to open in new window.]


  • In the literature, Google’s Transformer is also referred to as Multi-head dot product attention (MHDPA), and “self-attention. ”

  • Due to the absence of recurrent layers in the model, the Transformer model trained significantly faster and outperformed all previously reported ensembles.

  • Alexander Rush at HarvardNLP provides an excellent web page, The Annotated Transformer, complete with discussion and code (an “annotated” version of the paper in the form of a line-by-line implementation) [papercode]!

  • Discussion:
  • User implementations:
  • Attention Is All You Need coauthor Łukasz Kaiser posted slides describing this work (Tensor2Tensor Transformers New Deep Models for NLP  [local copydiscussion].

Later in 2018, Li et al. [Lukasz Kaiser; Samy Bengio | Google Research/Brain] described “Area Attention” (Oct 2018).

“Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.”

arxiv1810.10126-f1+f2.png

[Image source. Click image to open in new window.]


[Table of Contents]

Trellis Networks

for Sequence Modeling (Oct 2018; note also the Appendices) [codediscussion] by authors at Carnegie Mellon University and Intel Labs presented trellis networks, a new architecture for sequence modeling. A trellis network is a temporal convolutional network with special structure, characterized by weight tying across depth and direct injection of the input into deep layers. The authors show that truncated recurrent networks are equivalent to trellis networks with special sparsity structure in their weight matrices; thus trellis networks with general weight matrices generalize truncated recurrent networks. They leveraged those connections to design high-performing trellis networks that absorb structural and algorithmic elements from both recurrent and convolutional models. Experiments demonstrated that trellis networks outperform the current state of the art on a variety of challenging benchmarks, including word-level language modeling on Penn Treebank and WikiText-103, character-level language modeling on Penn Treebank, and stress tests designed to evaluate long-term memory retention.

arxiv1810.06682-f1.png

[Image source. Click image to open in new window.]


arxiv1810.06682-f2.png

[Image source. Click image to open in new window.]


arxiv1810.06682-f3.png

[Image source. Click image to open in new window.]


arxiv1810.06682-t1+t2.png

[Image source. Click image to open in new window.]


We presented trellis networks, a new architecture for sequence modeling. Trellis networks form a structural bridge between convolutional and recurrent models. …”

“There are many exciting opportunities for future work. First, we have not conducted thorough performance optimizations on trellis networks. … Future work can also explore acceleration schemes that speed up training and inference. Another significant opportunity is to establish connections between trellis networks and self-attention-based architectures (Transformers), thus unifying all three major contemporary approaches to sequence modeling. Finally, we look forward to seeing applications of trellis networks to industrial-scale challenges such as machine translation.”




Neural language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. In The Importance of Generation Order in Language Modeling Google Brain studied the influence of token generation order on model quality via a novel two-pass language model that produced partially-filled sentence “templates” and then filled in missing tokens. The most effective strategy generated function words in the first pass followed by content words in the second.

The Fine Tuning Language Models for Multilabel Prediction GitHub repository lists recent, leading language models – for which they examine the ability to use generative pretraining with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.

Adaptive Input Representations for Neural Language Modeling (Facebook AI Research; Oct 2018) [mentioned] introduced adaptive input representations for neural language modeling which extended the adaptive softmax of Grave et al. (2017 to input representations of variable capacity. This paper introduced adaptive input embeddings, which extended the adaptive softmax to input word representations. This factorization assigned more capacity to frequent words, and reduced the capacity for less frequent words with the benefit of reducing overfitting to rare words. There were several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. They performed a systematic comparison of popular choices for a self-attentional architecture. Their experiments showed that models equipped with adaptive embeddings were more than twice as fast to train than the popular character input CNN while having a lower number of parameters. They achieved a new state of the art on the WikiText-103 benchmark of 20.51 perplexity, improving the next best known result by 8.7 perplexity. On the Billion Word Benchmark, they achieved a state of the art of 24.14 perplexity.”

arxiv-1809.10853.png

[Image source. Click image to open in new window.]


arxiv1809.10853-t1+t2.png

[Image source. Click image to open in new window.]


  • Grave et al. (Facebook AI Research) Efficient softmax approximation for GPUs (Sep 2016; updated Jun 2017) [code]

    “We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computation time. Our approach further reduces the computational time by exploiting the specificities of modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax.”

    arxiv1609.04309-f3.png

    [Image source. Note similarity to Fig. 1 / use in Adaptive Input Representations for Neural Language Modeling. Click image to open in new window.]


[Table of Contents]

Probing the Effectiveness of Pretrained Language Models

Contextual word representations derived from pretrained bidirectional language models (biLM) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks, including question answering, entailment and sentiment classification, constituency parsing, named entity recognition, and text classification. However, many questions remain as to how and why these models are so effective.

Deep RNNs Encode Soft Hierarchical Syntax (May 2018), by Terra Blevins, Omer Levy, and Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, evaluated how well a simple feedforward classifier could detect syntax features (part of speech tags as well as various levels of constituent labels) from the word representations produced by the RNN layers from deep NLP models trained on the tasks of dependency parsing, semantic role labeling, machine translation, and language modeling. They demonstrated that deep RNN trained on NLP tasks learned internal representations that captured soft hierarchical notions of syntax across different layers of the model (i.e., the representations taken from deeper layers of the RNNs perform better on higher-level syntax tasks than those from shallower layers), without explicit supervision. These results provided some insight as to why deep RNNs are able to model NLP tasks without annotated linguistic features. ELMo, for example, represents each word using a task-specific weighted sum of the language models hidden layers; i.e., rather than using only the top layer, ELMo selects which of the language models internal layers contain the most relevant information for the task at hand.

An extremely interesting follow-on paper from Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, with colleagues at the Allen Institute for Artificial Intelligence – Dissecting Contextual Word Embeddings: Architecture and Representation (August 2018) [note also the Appendices in that paper] – presented a detailed empirical study of how the choice of neural architecture (e.g. LSTMCNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. They showed there is a tradeoff between speed and accuracy, but all architectures learned high quality contextual representations that outperformed word embeddings for four challenging NLP tasks (natural language inference/textual entailment; semantic role labeling; constituency parsing; named entity recognition).

  • That study also showed that Deep biLM learned a rich hierarchy of contextual information, both at the word and span level, that was captured in three disparate types of network architectures (LSTMCNN, or self attention). In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network in an analogous manner to how deep CNNs trained for image classification learn a hierarchy of image features (Zeiler and Fergus, 2014). For example, they showed that in contrast to traditional word vectors which encode some semantic information, the word embedding layer of deep biLMs focused exclusively on word morphology. Moving upward in the network, the lowest contextual layers of biLM focused on local syntax, while the upper layers could be used to induce more semantic content such as within-sentence pronominal coreferent clusters. They also showed that the biLM activations could be used to form phrase representations useful for syntactic tasks.

    Together, these results suggest that large scale biLM, independent of architecture, are learning much more about the structure of language than previous appreciated.

    Regarding the following figure, note the more-or-less similar behavior of the three models on various tasks, that differ in difficulty/complexity; particularly note the changes in accuracy throughout the depth (layers) in those models. Layer-wise quantitative data are provided in the Appendix, in that paper.

    arxiv-1808.08949.png

    [Image source. Click image to open in new window.]


Evaluation of Sentence Embeddings in Downstream and Linguistic Probing Tasks (Jun 2018) surveyed recent unsupervised word embedding models, including fastTextELMoInferSent, and other models (discussed elsewhere in this REVIEW). They noted that two main challenges exist when learning high-quality representations: they should capture semanticssyntax,  and the different meanings the word can represent in different contexts (polysemy).

ELMo addressed both of those issues. As in fastText, ELMo breaks the tradition of word embeddings by incorporating sub-word units, but ELMo has also some fundamental differences with previous shallow representations such as fastText or Word2Vec. ELMo uses a deep representation by incorporating internal representations of the LSTM network, therefore capturing the meaning and syntactical aspects of words. Since ELMo is based on a language model, each token representation is a function of the entire input sentence, which can overcome the limitations of previous word embeddings where each word is usually modeled as an average of their multiple contexts. ELMo embeddings provide a better understanding of the contextual meaning of a word, as opposed to traditional word embeddings that are not only context-independent but have a very limited definition of context.

In that paper, it was also interesting to see how the different models performed on different tasks. For example:

  • As discussed in Section 5.1/Table 6
    Source
      (datasets
    Source
    ), ELMo (a language model that employs two Bi-LSTM layers), the Transformer (attention-based) version of USE (Universal Sentence Encoder), and InferSent (a Bi-LSTM trained on the SNLI dataset) generally performed well on downstream classification tasks (Table 6).

    “As seen in Table 6, although no method had a consistent performance among all tasks, ELMo achieved best results in 5 out of 9 tasks. Even though ELMo was trained on a language model objective, it is important to note that in this experiment a bag-of-words approach was employed. Therefore, these results are quite impressive, which lead us to believe that excellent results can be obtained by integrating ELMo and [it’s] trainable task-specific weighting scheme into InferSent. InferSent achieved very good results in the paraphrase detection as well as in the SICK-E (entailment). We hypothesize that these results were due to the similarity of these tasks to the tasks were InferSent was trained on (SNLI and MultiNLI). … The Universal Sentence Encoder (USE) model with the Transformer encoder also achieved good results on the product review (CR) and on the question-type (TREC) tasks. Given that the USE model was trained on SNLI as well as on web question-answer pages, it is possible that these results were also due to the similarity of these tasks to the training data employed by the USE model.”

  • Discussed in Section 5.2/Table 7
    Source
      (datasets
    Source
    ) USE-Transformer and InferSent performed the best on semantic relatedness and textual similarity tasks.

  • Discussed in Section 5.3/Table 8
    Source
      (datasets
    Source
    ) ELMo generally outperformed the other models on linguistic probing tasks.

  • Discussed in Section 5.4/Table 9
    Source
      (datasets
    Source
    ), InferSent outperformed the other models in information retrieval tasks.

Neural language models (LM) are more capable of detecting long distance dependencies than traditional n-gram models, serving as a stronger model for natural language. However, it is unclear what kind of properties of language these models encode, preventing their use as explanatory models, and relating them to formal linguistic knowledge of natural language. There is increasing interest in investigating the kinds of linguistic information that are represented by LM, with a strong focus on their syntactic abilities, as well as semantic understanding, such as negative polarity items (NPI). NPI are a class of words that bear the special feature that they need to be licensed by a specific licensing context (LC). A common example of an NPI and LC in English are any and not, respectively: the sentence “He didn’t buy any books.” is correct, whereas “He did buy any books.” is not correct.

Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items (Aug 2018) discussed language models and negative polarity items showing that the model found a relation between the licensing context and the negative polarity item and appeared to be aware of the scope of this context, which they extracted from a parse tree of the sentence. This research paves the way for other studies linking formal linguistics to deep learning.

arxiv-1808.10627c.png

[Image source. Click image to open in new window.]


Character language models have access to surface morphological patterns, but it is not clear whether or how they learn abstract morphological regularities. Indicatements that Character Language Models Learn English Morpho-syntactic Units and Regularities (Aug 2018) studied a “wordless” character language model with several probes, finding that it could develop a specific unit to identify word boundaries and, by extension, morpheme boundaries, which allowed it to capture linguistic properties and regularities of these units. Their language model proved surprisingly good at identifying the selectional restrictions of English derivational morphemes, a task that required both morphological and syntactic awareness. They concluded that, when morphemes overlap extensively with the words of a language, a character language model can perform morphological abstraction.

A morpheme is a meaningful morphological unit of a language that cannot be further divided; e.g., “incoming” consists of the morphemes “in”, “come” and “-ing”. Another example: “dogs” consists of two morphemes and one syllable: “dog”, and “-s”. A morpheme may or may not stand alone, whereas a word, by definition, is freestanding.

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (Sep 2018) [code] showed that each embedding model captured more information than is directly apparent, yet their potential performance is limited by the impossibility of optimally surfacing divergent linguistic information at the same time. For example, in word analogy experiments they were are able to achieve significant improvements over the original embeddings, yet every improvement in semantic analogies came at the cost of a degradation in syntactic analogies and vice versa. At the same time, their work showed that the effect of this phenomenon was different for unsupervised systems that directly used embedding similarities and supervised systems that use pretrained embeddings as features, as the latter had enough expressive power to learn the optimal balance themselves.

Relevant to the language models domain (if not directly employed), Firearms and Tigers are Dangerous, Kitchen Knives and Zebras are Not: Testing whether Word Embeddings Can Tell (Sep 2018) presented an approach to investigating the nature of semantic information captured by word embeddings. They tested the ability of supervised classifiers (a logistic regression classifier, and a basic neural network) to identify semantic features in word embedding vectors and compared this to a feature identification method based on full vector cosine similarity. The idea behind this method was that properties identified by classifiers (but not through full vector comparison) are captured by embeddings; properties that cannot be identified by either method are not captured by embeddings. Their results provided an initial indication that semantic properties relevant for the way entities interact (e.g. dangerous) were captured, while perceptual information (e.g. colors) were not represented.

Generative adversarial networks (GANs) are a promising approach for text generation that, unlike traditional language models (LM), does not suffer from the problem of “exposure bias”. However, A major hurdle for understanding the potential of GANs for text generation is the lack of a clear evaluation metric. Evaluating Text GANs as Language Models (Oct 2018) proposed approximating the distribution of text generated by a GAN, which permitted evaluating them with traditional probability-based LM metrics. They applied their approximation procedure on several GAN-based models, showing that they performed substantially worse than state of the art LMs. Their evaluation procedure promoted better understanding of the relation between GANs and LMs, and could accelerate progress in GAN-based text generation.

arxiv1810.12686-f1.png

[Image source. Click image to open in new window.]


arxiv1810.12686-t1+f2.png

[Image source. Click image to open in new window.]


Language Models:

Additional Reading

  • Transformer-XL: Language Modeling with Longer-Term Dependency (ICLR 2019) [discussion;sp [mentioned]

    “We propose a novel architecture, Transformer-XL, for language modeling with longer-term dependency. Our main technical contributions include introducing the notion of recurrence in a purely self-attentive model and deriving a novel positional encoding scheme. Transformer-XL is the first self-attention model that achieves substantially better results than RNNs on both character-level and word-level language modeling. Transformer-XL is also able to model longer-term dependency than RNNs and Transformer.”

  • Improving Sentence Representations with Multi-view Frameworks

    “… we present two multi-view frameworks for learning sentence representations in an unsupervised fashion. One framework uses a generative objective and the other a discriminative one. In both frameworks, the final representation is an ensemble of two views, in which one view encodes the input sentence with a recurrent neural network (RNN) and the other view encodes it with a simple linear model. We show that, after learning, the vectors produced by our multi-view frameworks provide improved representations over their single-view learned counterparts, and the combination of different views gives representational improvement over each view and demonstrates solid transferability on standard downstream tasks.”

  • BioSentVec: creating sentence embeddings for biomedical texts (Oct 2018) [dataset]

    • “… Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings.”

    • BioWordVec: biomedical word embeddings with fastText. We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms).

    BioSentVec [1]: biomedical sentence embeddings with sent2vec. We applied sent2vec to compute the 700-dimensional sentence embeddings. We used the bigram model and set window size to be 20 and negative examples 10.

  • Do RNNs learn human-like abstract word order preferences? (Nov 2018) [code]

    “RNN language models have achieved state-of-the-art results on various tasks, but what exactly they are representing about syntax is as yet unclear. Here we investigate whether RNN language models learn humanlike word order preferences in syntactic alternations. We collect language model surprisal [← sic] scores for controlled sentence stimuli exhibiting major syntactic alternations in English: heavy NP shift, particle shift, the dative alternation, and the genitive alternation. We show that RNN language models reproduce human preferences in these alternations based on NP length, animacy, and definiteness. We collect human acceptability ratings for our stimuli, in the first acceptability judgment experiment directly manipulating the predictors of syntactic alternations. We show that the RNNs’ performance is similar to the human acceptability ratings and is not matched by an n-gram baseline model. Our results show that RNNs learn the abstract features of weight, animacy, and definiteness which underlie soft constraints on syntactic alternations.”

  • Natural language understanding for task oriented dialog in the biomedical domain in a low resources context (Nov 2018)

    “In the biomedical domain, the lack of sharable datasets often limit the possibility of developing natural language processing systems, especially dialogue applications and natural language understanding models. To overcome this issue, we explore data generation using templates and terminologies and data augmentation approaches. Namely, we report our experiments using paraphrasing and word representations learned on a large EHR corpus with fastText and ELMo, to learn a NLU model without any available dataset. We evaluate on a NLU task of natural language queries in EHRs divided in slot-filling and intent classification sub-tasks. On the slot-filling task, we obtain a F-score of 0.76 with the ELMo representation; and on the classification task, a mean F-score of 0.71. Our results show that this method could be used to develop a baseline system.”

[Table of Contents]

Probing Hierarchical Syntax Embedded in Pretrained Language Model Layers

Here I collate and summarize/paraphrase discussion that relates to the soft hierarchical syntax captured in various layers (embeddings) in pretrained language models. Very exciting and very powerful.



  • Richard Socher and colleagues [SalesForce: A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks (Jul 2017)] introduced a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. Higher layers included shortcut connections (i.e., residual layers) to lower-level task predictions to reflect linguistic hierarchies. They used a simple regularization term to allow for optimizing all model weights to improve the loss of a single task, without exhibiting catastrophic interference of the other tasks.

    • “… We presented a joint many-task model to handle multiple NLP tasks with growing depth in a single end-to-end model. Our model is successively trained by considering linguistic hierarchies, directly feeding word representations into all layers, explicitly using low-level predictions, and applying successive regularization. In experiments on five NLP tasks, our single model achieves the state-of-the-art or competitive results on chunking, dependency parsing, semantic relatedness, and textual entailment.

    arxiv1611.01587-fig1.png

    [Image source. Click image to open in new window.]


    arxiv1611.01587-t1_through_t8.png

    [Image source. Click image to open in new window.]


    arxiv1611.01587-t9+t10+t11+t12.png

    [Image source. Click image to open in new window.]


  • Deep RNNs Encode Soft Hierarchical Syntax (May 2018), by Terra Blevins, Omer Levy, and Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, evaluated how well a simple feedforward classifier could detect syntax features (part of speech tags as well as various levels of constituent labels) from the word representations produced by the RNN layers from deep NLP models trained on the tasks of dependency parsing, semantic role labeling, machine translation, and language modeling. They demonstrated that deep RNN trained on NLP tasks learned internal representations that captured soft hierarchical notions of syntax across different layers of the model (i.e., the representations taken from deeper layers of the RNNs perform better on higher-level syntax tasks than those from shallower layers), without explicit supervision. These results provided some insight as to why deep RNNs are able to model NLP tasks without annotated linguistic features. ELMo, for example, represents each word using a task-specific weighted sum of the language models hidden layers; i.e., rather than using only the top layer, ELMo selects which of the language models internal layers contain the most relevant information for the task at hand.

    • Specifically, they trained the models to predict POS tags as well as constituent labels at different depths of a parse tree. They found that all models indeed encoded a significant amount of syntax and – in particular – that language models learned some syntax.

    arxiv1805.04218-t1+f1+f2+f3.png

    [Image source. Click image to open in new window.]


    arxiv1805.04218-t2+t3.png

    [Image source. Click image to open in new window.]


  • Retrieve-and-Read: Multi-task Learning of Information Retrieval and Reading Comprehension (Aug 2018) presented an interesting approach, “machine reading at scale” (MRS) wherein, given a question, a system retrieves passages relevant to the question from a corpus (IR: information retrieval) and then extracts the answer span from the retrieved passages (RC: reading comprehension). …

    arxiv1808.10628b.png

    [Image source. Click image to open in new window.]


    • “Our Retrieve-and-Read model is based on the bi-directional attention flow (BiDAF ) model, which is a standard RC model. As shown in Figure 2 [above] it consists of six layers: … We note that the RC component trained with single-task learning is essentially equivalent to BiDAF, except for the word embedding layer that has been modified to improve accuracy. … Note that the original BiDAF uses a pre-trained GloVe and also trains character-level embeddings by using a CNN in order to handle out-of-vocabulary (OOV) or rare words. Instead of using GloVe and CNN, our model uses fastText for the fixed pre-trained word vectors and removes character-level embeddings. The fastText model takes into account subword information and can obtain valid representations even for OOV words.”

  • Much effort has been devoted to evaluating whether multitask learning can be leveraged to learn rich representations that can be used in various NLP downstream applications. However, there is still a lack of understanding of the settings in which multitask learning has a significant effect. A Hierarchical Multitask Approach for Learning Embeddings from Semantic Tasks (Nov 2018) [codedemomedia], by Sanh et al. [Sebastian Ruder], introduced a hierarchical model trained in a multitask learning setup on a set of carefully selected semantic tasks. The model was trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieved state of the art results on a number of tasks – named entity recognition, entity mention detection and relation extraction – without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induced a set of shared semantic representations at lower layers of the model. They showed that as they moved from the bottom to the top layers of the model, the hidden states of the layers tended to represent more complex semantic information.

    arxiv1811.06031-t1+t6.png

    [Image source. Click image to open in new window.]


    arxiv1811.06031-f1.png

    [Image source. Click image to open in new window.]


    arxiv1811.06031-t3+t4.png

    [Image source. Click image to open in new window.]


    Demo (note errors!):

    1811.06031-demo.png

    [Image source. Click image to open in new window.]


[Probing Hierarchical Syntax Embedded in Pretrained Language Model Layers:]

See also:

  • Gated Self-Matching Networks  (R-Net) (2017) – multilayer, end-to-end neural networks whose novelty lay in the use of a gated attention mechanism to provide different levels of importance to different parts of passages. It also used a self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contained character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer.

    Wang2017-f1.png

    [Image source. Click image to open in new window.]


  • Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. However, we still do not have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn.

    Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis compared four objectives – language modeling, translation, skip-thought, and autoencoding – on their ability to induce syntactic and part of speech information. They made a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. They found that representations from language models consistently performed best on their syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggested that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. They also found that the representations from randomly-initialized, frozen LSTMs performed strikingly well on their syntactic auxiliary tasks, but that effect disappeared when the amount of training data for the auxiliary tasks was reduced.

    arxiv1809.10040-t1.png

    [Image source. Click image to open in new window.]


    arxiv1809.10040-f2.png

    [Image source. Click image to open in new window.]


    arxiv1809.10040-f4+f5.png

    [Image source. Click image to open in new window.]


  • Hierarchical Multitask Learning for CTC-based Speech Recognition (Jul 2018) [Summary]Previous work has shown that neural encoder-decoder speech recognition can be improved with hierarchical multitask learning, where auxiliary tasks are added at intermediate layers of a deep encoder. We explore the effect of hierarchical multitask learning in the context of connectionist temporal classification (CTC)-based speech recognition, and investigate several aspects of this approach. Consistent with previous work, we observe performance improvements on telephone conversational speech recognition (specifically the Eval2000 test sets) when training a subword-level CTC model with an auxiliary phone loss at an intermediate layer. We analyze the effects of a number of experimental variables (like interpolation constant and position of the auxiliary loss function), performance in lower-resource settings, and the relationship between pretraining and multitask learning. We observe that the hierarchical multitask approach improves over standard multitask training in our higher-data experiments, while in the low-resource settings standard multitask training works well. The best results are obtained by combining hierarchical multitask learning and pretraining, which improves word error rates by 3.4% absolute on the Eval2000 test sets.

[Table of Contents]

RNN, CNN, or Self-Attention?

In the course of writing this REVIEW and in my other readings I often encountered discussions of RNN vs. CNN. vs. self-attention architectures in regard to NLP and language models. Here, I collate and summarize/paraphrase some of those observations; green-colored URL are internal hyperlinks to discussions of those items elsewhere in this REVIEW.

  • Dissecting Contextual Word Embeddings: Architecture and Representation (Aug 2018) discussed contextual word representations derived from pretrained bidirectional language models (biLM), showing that Deep biLM learned a rich hierarchy of contextual information that was captured in three disparate types of network architectures: LSTMCNN, or self attention. In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network.

  • Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers (Jun 2018) studied the role of linguistic context in predicting quantifiers (“few”, “all”). Overall, LSTM were the best-performing architectures, with CNN showing some potential in the handling of longer sequences.

    arxiv-1806.00354d.png

    [Image source. Click image to open in new window.]


  • Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (May 2018) discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of a simple RNN (S-RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicated that LSTM suffer little to no performance loss when removing the S-RNN. This provided evidence that the gating mechanism was doing the heavy lifting in modeling context. They further ablated the recurrence in each gate and found that this incurred only a modest drop in performance, indicating that the real modeling power of LSTM stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allowed them to mathematically relate LSTM and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enabled the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.

  • While RNN are a cornerstone in learning latent representations from long text sequences, a purely convolutional and deconvolutional autoencoding framework may be employed, as described in Deconvolutional Paragraph Representation Learning (Sep 2018). That paper addressed the issue that the quality of sentences during RNN-based decoding (reconstruction) decreased with the length of the text. Compared to RNN, their framework was better at reconstructing and correcting long paragraphs. Note Table 1 in their paper (showing paragraphs reconstructed
    Source
    by LSTM and CNN, as well as the vastly superior BLEU / ROUGE scores in Table 2
    Source
    ); there is also additional NLP-related LSTM vs. CNN discussion in this Hacker News thread.

  • Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition (Aug 2018) compared the use of LSTM based and CNN based character level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus showed that the use of either type of character level word embeddings in conjunction with the BiLSTM-CRF models led to comparable state of the art performance. However, the models using CNN based character level word embeddings had a computational performance advantage, increasing training time over word based models by 25% while the LSTM based character level word embeddings more than doubled the required training time.

    arxiv-1808.08450a.png

    [Image source. Click image to open in new window.]


    arxiv-1808.08450b.png

    [Image source. Click image to open in new window.]


  • Recently, non-recurrent architectures (convolutional; self-attentional) have outperformed RNN in neural machine translation. CNN and self-attentional networks can connect distant words via shorter network paths than RNN, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument had not been tested empirically, nor had alternative explanations for their strong performance been explored in-depth. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (Aug 2018) hypothesized that the strong performance of CNN and self-attentional networks could be due to their ability to extract semantic features from the source text. They evaluated RNN, CNN and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Experimental results showed that self-attentional networks and CNN did not outperform RNN in modeling subject-verb agreement over long distances, and that self-attentional networks performed distinctly better than RNN and CNN on word sense disambiguation.

    arxiv-1808.08946d.png

    [Image source. Click image to open in new window.]


  • Recent advances in network architectures for neural machine translation (NMT) have effectively replaced recurrent models with either convolutional or self-attentional approaches, such as the Transformer architecture. While the main innovation of Transformer was its use of self-attentional layers, there are several other aspects – such as attention with multiple heads, and the use of many attention layers – that distinguished the model from previous baselines. How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures (2018) [code] took take a fine-grained look at the different architectures for NMT. They introduced an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks. Making use of that language, they showed that one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention. Additionally, they found that self-attention was much more important for the encoder than for the decoder, where in most settings it could be replaced by a RNN or CNN without a loss in performance. Surprisingly, even a model without any target side self-attention performed well.

    “We found that RNN based models benefit from multiple source attention mechanisms and residual feed-forward blocks. CNN based models on the other hand can be improved through layer normalization and also feed-forward blocks. These variations bring the RNN and CNN based models close to the Transformer. Furthermore, we showed that one can successfully combine architectures. We found that self-attention is much more important on the encoder side than it is on the decoder side, where even a model without self-attention performed surprisingly well. For the data sets we evaluated on, models with self-attention on the encoder side and either an RNN or CNN on the decoder side performed competitively to the Transformer model in most cases.”

  • Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (Aug 2018) proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance. They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy. All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

    The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction. In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

  • The architecture proposed in *QANet* : Combining Local Convolution with Global Self-Attention for Reading Comprehension (Apr 2018) did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD1.1 dataset their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data.

    • Likewise, a later paper, A Fully Attention-Based Information Retriever (Oct 2018) that also relied entirely on a (convolutional and/or) self-attentional model achieved competitive results on SQuAD1.1 while having fewer parameters and being faster at both learning and inference than rival (largely RNN-based) methods. Their FABIR model was significantly outperformed by the highly similar – and non-cited – competing QANet model.

  • Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension (Jun 2018), performed as well as QANet. Based on a Bi-LSTM, Reinforced Mnemonic Reader is an enhanced attention reader – suggesting perhaps that the improvements in QANet, Reinforced Mnemonic Reader, and the work described in Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (second preceding paragraph) was due to the attention mechanisms, rather than the RNN or CNN architectures.

  • Likewise, Constituency Parsing with a Self-Attentive Encoder (May 2018) [code] demonstrated that replacing a LSTM encoder with a self-attentive architecture could lead to improvements to a state of the art discriminative constituency parser. The use of attention made explicit the manner in which information was propagated between different locations in the sentence; for example, separating positional and content information in the encoder led to improved parsing accuracy. They evaluated a version of their model that used ELMo as the sole lexical representation, using publicly available ELMo weights. Trained on the Penn Treebank, their parser attained
    Source
    93.55 $\small F_1$ without the use of any external data, and 95.13 $\small F_1$ when using pre-trained word representations. The gains came not only from incorporating more information (such as subword features or externally trained word representations), but also from structuring the architecture to separate different kinds of information from each other.

    arxiv1805.01052e.png

    [Image source. Click image to open in new window.]


    arxiv1805.01052d.png

    [Image source. Click image to open in new window.]


[Table of Contents]

LSTM, Attention and Gated (Recurrent) Units

Here I collate and summarize/paraphrase gated unit mechanism-related discussion from elsewhere in this REVIEW.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Sep 2014) by Kyunghyun Cho et al. [Yoshua Bengio] described a RNN encoder-decoder for statistical machine translation, that introduced a new type of hidden unit ($\small \mathcal{f}$ in the equation, below) – the gated recurrent unit (GRU) – that was motivated by the LSTM unit but was much simpler to compute and implement.

    Recurrent Neural Networks. A recurrent neural network (RNN) is a neural network that consists of a hidden state $\small \mathbf{h}$ and an optional output $\small \mathbf{y}$ which operates on a variable-length sequence $\small \mathbf{x} = (x_1, \ldots, x_T)$. At each time step $\small t$, the hidden state $\small \mathbf{h_{\langle t \rangle}}$ of the RNN is updated by

      $\small \mathbf{h_{\langle t \rangle}} = f (\mathbf{h_{\langle t-1 \rangle}}, x_t)$,
    where $\small \mathcal{f}$ is a non-linear activation function. $\small \mathcal{f}$ may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997).

    Hidden Unit that Adaptively Remembers and Forgets. ... we also propose a new type of hidden unit ($\small f$ in the equation, above) that has been motivated by the LSTM unit but is much simpler to compute and implement. [The LSTM unit has a memory cell and four gating units that adaptively control the information flow inside the unit, compared to only two gating units in the proposed hidden unit.]

    This figure shows the graphical depiction of the proposed hidden unit:

    arxiv1406.1078-f2.png
    [Image source. Click image to open in new window.]

    "In this formulation [see Section 2.3 in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation for details], when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

    "On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit (Bengio et al., 2013). As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active. ..."

In their very highly cited paper Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014; updated May 2016), Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio employed their “gated hidden unit” (a GRU) – introduced by Cho et al. (2014) (above) – for neural machine translation. Their model consisted of a forward and backward pair of RNN (BiRNN) for the encoder, and a decoder that emulated searching through a source sentence while decoding a translation.

From Appendix A in that paper:

“For the activation function $\small f$ of an RNN, we use the gated hidden unit recently proposed by Cho et al. (2014a). The gated hidden unit is an alternative to the conventional simple units such as an element-wise $\small \text{tanh}$. This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies. This is made possible by having computation paths in the unfolded RNN for which the product of derivatives is close to 1. These paths allow gradients to flow backward easily without suffering too much from the vanishing effect. It is therefore possible to use LSTM units instead of the gated hidden unit described here, as was done in a similar context by Sutskever et al. (2014).”

Discussed in Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2018):

“The attention mechanism has been a breakthrough in neural machine translation (NMT) in recent years. This mechanism calculates how much attention the network should give to each source word to generate a specific translated word. The context vector calculated by the attention mechanism mimics the syntactic skeleton of the input sentence precisely given a sufficient number of examples. Recent work suggests that incorporating explicit syntax alleviates the burden of modeling grammatical understanding and semantic knowledge from the model.”

GRUs have fewer parameters than LSTM, as they lack an output gate. A GRU has two gates (an update gate and reset gate), while a RNN has three gates (update, forget and output gates). The GRU update gate decides on how much of information from the past should be let through, while the reset gate decides on how much of information from the past should be discarded. What motivates this? Although RNNs can theoretically capture long-term dependencies, they are actually very hard to train to do this [see this discussion]. GRUs are designed to have more persistent memory, thereby making it easier for RNNs to capture long-term dependencies. Even though computationally a GRU is more efficient than an LSTM network, due to the reduction of gates it still comes second to LSTM network in terms of performance. Therefore, GRUs are often used when we need to train faster, and we don’t have much computational power.

Counting in Language with RNNs (Oct 2018) examined a possible reason for LSTM outperforming GRU on language modeling and more specifically machine translation. They hypothesized that this had to do with counting – a consistent theme across the literature of long term dependence, counting, and language modeling for RNNs. Using the simplified forms of language – context-free and context-sensitive languages – they showed how the LSTM performs its counting based on their cell states during inference, and why the GRU cannot perform as well.

“As argued in the Introduction, we believe there is a lot of evidence supporting the claim that success at language modeling requires an ability to count. Since there is empirical support for the fact that the LSTM outperforms the GRU in language related tasks, we believe that our results showing how fundamental this inability to count is for the GRU, we believe we make a contribution to the study of both RNNs and their success on language related tasks. Our experiments along with the other recent paper by Weiss et al. [2017], show almost beyond reasonable doubt that the GRU is not able to count as well as the LSTM, furthering our hypothesis that there is a correlation between success at performance on language related tasks and the ability to count.”

Germane to this subsection (“LSTM, Attention and Gated (Recurrent) Units”) is the excellent companion blog post to When Recurrent Models Don't Need To Be Recurrent, in which coauthor John Miller discusses a very interesting paper by Dauphin et al., Language Modeling with Gated Convolutional Networks (Sep 2017). Some highlights from that paper:

  • Gating has been shown to be essential for recurrent neural networks to reach state-of-the-art performance. Our gated linear units reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities (Section 5.2). We show that gated convolutional networks outperform other recently published language models such as LSTMs trained in a similar setting on the Google Billion Word Benchmark (Chelba et al., 2013). …

  • “Gating mechanisms control the path through which information flows in the network and have proven to be useful for recurrent neural networks. LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep. In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers. We show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. …

  • “Gated linear units are a simplified gating mechanism based on the work of Dauphin & Grangier [Predicting distributions with Linearizing Belief Networks (Nov 2015; updated May 2016)] for non-deterministic gates that reduce the vanishing gradient problem by having linear units coupled to the gates. This retains the non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. … We compare the different gating schemes experimentally in Section 5.2 and we find gated linear units allow for faster convergence to better perplexities.”

  • “The unlimited context offered by recurrent models is not strictly necessary for language modeling.” 

    “In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

  • “The ‘infinite memory’ advantage of RNNs is largely absent in practice.” 

    As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM $\small n$-gram model with $\small n=13$ words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

  • “Recurrent models trained in practice are effectively feedforward.” 

    This could happen either because truncated backpropagation time cannot learn patterns significantly longer than $\small k$ steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

Gated-Attention Readers for Text Comprehension (Jun 2016; updated Apr 2017) [code], by Ruslan Salakhutdinov and colleagues, employed the attention mechanism introduced by Yoshua Bengio and colleagues (Neural Machine Translation by Jointly Learning to Align and Translate) in their model, the Gated-Attention Reader (GA Reader). The GA Reader integrated a multi-hop architecture with a novel attention mechanism, which was based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enabled the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtained state of the art results on three benchmarks for this task. The effectiveness of multiplicative interaction was demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention.

“Deep learning models have been shown to outperform traditional shallow approaches on text comprehension tasks. The success of many recent models can be attributed primarily to two factors:

  1. Multi-hop architectures allow a model to scan the document and the question iteratively for multiple passes.
  2. Attention mechanisms, borrowed from the machine translation literature, allow the model to focus on appropriate subparts of the context document.

Intuitively, the multi-hop architecture allows the reader to incrementally refine token representations, and the attention mechanism re-weights different parts in the document according to their relevance to the query.

… In this paper, we focus on combining both in a complementary manner, by designing a novel attention mechanism which gates the evolving token representations across hops. … More specifically, unlike existing models where the query attention is applied either token-wise or sentence-wise to allow weighted aggregation, the Gated-Attention module proposed in this work allows the query to directly interact with each dimension of the token embeddings at the semantic-level, and is applied layer-wise as information filters during the multi-hop representation learning process. Such a fine-grained attention enables our model to learn conditional token representations with respect to the given question, leading to accurate answer selections.”

arxiv1606.01549-f1.png

[Image source. Click image to open in new window.]


arxiv1606.01549-f3.png

[Image source. Click image to open in new window.]


arxiv1606.01549-t3.png

[Image source. Click image to open in new window.]


A recent review, Comparative Analysis of Neural QA Models on SQuAD (Jun 2018), reported that models based on a gated attention mechanism (R-Net ), or a GRU (DocQA ), performed well across a variety of tasks.

Gated Self-Matching Networks  (R-Net) – proposed by Wang et al. (2017) [code] – were multilayer, end-to-end neural networks whose novelty lay in the use of a gated attention mechanism to provide different levels of importance to different parts of passages. It also used a self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contained character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer.

arxiv1806.06972-tables1+2.png

[Image source. Click image to open in new window.]


“… we present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage. We first match the question and passage with gated attention-based recurrent networks to obtain the question-aware passage representation. Then we propose a self-matching attention mechanism to refine the representation by matching the passage against itself, which effectively encodes information from the whole passage. We finally employ the pointer networks to locate the positions of answers from the passages. We conduct extensive experiments on the SQuAD dataset. The single model achieves 71.3% on the evaluation metrics of exact match on the hidden test set, while the ensemble model further boosts the results to 75.9%. At the time of submission of the paper, our model holds the first place on the SQuAD Leaderboard for both single and ensemble model.”

  • “We choose to use Gated Recurrent Unit (GRU) (Cho et al., 2014) in our experiment since it performs similarly to LSTM (Hochreiter and Schmidhuber, 1997) but is computationally cheaper. … We propose a gated attention-based recurrent network to incorporate question information into passage representation. It is a variant of attention-based recurrent networks, with an additional gate to determine the importance of information in the passage regarding a question.”

Wang2017-f1.png

[Image source. Click image to open in new window.]


Wang2017-t2.png

[Image source. Click image to open in new window.]


Wang2017-f2.png

[Image source. Click image to open in new window.]


Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension (Jun 2017; updated Jan 2018), a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. The Microsoft Research approach employed bidirectional gated recurrent units (BiGRU) instead of RNN (Bi-LSTM). The answer extraction model was first employed to predict the most important sub-spans from the passage as evidence, which the answer synthesis model took as additional features along with the question and passage to further elaborate the final answers. They built the answer extraction model for single passage reading comprehension, and proposed an additional task of ranking the single passages to help in answer extraction from multiple passages.

Facebook AI Research recently (May 2018) developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation), demonstrated via story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation). They tackled the challenges of story-telling with a hierarchical model, which first generated a sentence called “the prompt” (describing the topic for the story), and then “conditioned” on this prompt when generating the story. Conditioning on the prompt or premise made it easier to generate consistent stories, because they provided grounding for the overall plot. It also reduced the tendency of standard sequence models to drift off topic. To improve the relevance of the generated story to its prompt, they adopted a GRU-based fusion mechanism, which pretrains a language model and subsequently trains a seq2seq model with a gating mechanism that learns to leverage the final hidden layer of the language model during seq2seq training. The model showed, for the first time, that fusion mechanisms could help seq2seq models build dependencies between their input and output.

  • The gated self-attention mechanism allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context).

  • Similar to Google’s Transformer, Facebook AI Research used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (see Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with gated linear unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

Researchers at Peking University (Junyang Lin et al.) recently developed a model that employed a Bi-LSTM decoder in a text summarization task [Global Encoding for Abstractive Summarization (Jun 2018)]. Their approach differed from a similar approach [not cited] by Richard Socher and colleagues at Salesforce, in that Lin et al. fed their encoder output at each time step into a convolutional gated unit, that with a self-attention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global source-side information. Self-attention encouraged the model to learn long-term dependencies, without creating much computational complexity. The gate (based on the generation from the CNN and self-attention module for the source representations from the RNN encoder) could perform global encoding on the encoder outputs. Based on the output of the CNN and self-attention, the logistic sigmoid function outputted a vector of value between 0 and 1 at each dimension. If the value was close to 0, the gate removed most of the information at the corresponding dimension of the source representation, and if it was close to 1 it reserved most of the information. The model thus performed neural abstractive summarization through a global encoding framework, which controlled the information flow from the encoder to the decoder based on the global information of the source context, generating summaries of higher quality while reducing repetition.

In October 2018 Myeongjun Jang and Pilsung Kang at Korea University presented Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition, which introduced their P-thought  model. P-thought employed a seq2seq structure with a gated recurrent unit (GRU) cell. The encoder transformed the sequence of words from an input sentence into a fixed-sized representation vector, whereas the decoder generated the target sentence based on the given sentence representation vector. The P-thought model had two decoders: when the input sentence was given, the first decoder, named “auto-decoder,” generated the input sentence as-is. The second decoder, named “paraphrase-decoder,” generated the paraphrase sentence of the input sentence.

Biomedical event extraction is a crucial task in biomedical text mining. As the primary forum for international evaluation of different biomedical event extraction technologies, the BioNLP Shared Task represents a trend in biomedical text mining toward fine-grained information extraction. The 2016 BioNLP Shared Task (BioNLP-ST 2016) proposed three tasks, in which the “Bacteria Biotope” (BB) event extraction task was added to the previous BioNLP-ST. Biomedical event extraction based on GRU integrating attention mechanism (Aug 2018) proposed a novel gated recurrent unit network framework (integrating an attention mechanism) for extracting biomedical events between biotopes and bacteria from the biomedical literature, utilizing the corpus from the BioNLP-ST 2016 Bacteria Biotope task. The experimental results showed that the presented approach could achieve an $\small F$-score of 57.42% in the test set, outperforming previous state of the art official submissions to BioNLP-ST 2016.

PMID30367569-f1.png

[Image source. Click image to open in new window.]


PMID30367569-f2.png

[Image source. Click image to open in new window.]


PMID30367569-t2+t3.png

[Image source. Click image to open in new window.]


LSTM, Attention and Gated (Recurrent) Units:

Additional Reading

[Table of Contents]

Question Answering and Reading Comprehension

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accept queries in natural language and return accurate answers quickly and efficiently.

Question answering (a natural language understanding problem) and reading comprehension (the task of answering a natural language question about a paragraph) are of considerable interest in NLP motivating, for example, the Human-Computer Question Answering Competition (in the NIPS 2017 Competition Track), and the BioASQ Challenge in the BioNLP domain. Unlike generic text summarization, reading comprehension systems facilitate the answering of targeted questions about specific documents, efficiently extracting facts and insights (How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks).

The Stanford Question Answering Dataset / Leaderboard (SQuAD: developed at Stanford University) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment or span of text from the corresponding reading passage (or, the question might be unanswerable). There has been a rapid progress on the SQuAD dataset, and early in 2018 engineered systems started achieving and surpassing human level accuracy on the SQuAD1.1 Leaderboard  (discussion: AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems). SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written by crowdworkers to look adversarially similar to answerable ones (ACL 2018 Best Short Paper: Know What You Don’t Know: Unanswerable Questions for SQuAD;  [project/code]). To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Carnegie Mellon University/Google Brain’s QANet : Combining Local Convolution with Global Self-Attention for Reading Comprehension” (Apr 2018) [OpenReviewdiscussion/TensorFlow implementationcode] proposed a method (QANet) that did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD dataset (SQuAD1.1: see the leaderboard), their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data.

  • Note that A Fully Attention-Based Information Retriever (Oc 2018) – which failed to cite this earlier, more performant QANet paper/work which scores much higher on the SQuAD1.1 Leaderboard – also employed an entirely convolutional and/or self-attention architecture, which performed satisfactorily on the SQuAD1.1 dataset and was faster to train than RNN-based approaches.

arxiv1804.09541.png

[Image source. Click image to open in new window.]


adversarial_SQuAD.png

[Image sources: Table 6Table 3. Click image to open in new window.]


Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension (May 2017; updated Jun 2018) [non-author implementations: MnemonicReader | MRC | MRC-models] performed as well as QANet, outperforming previous systems by over 6% in terms of both Exact Match (EM) and $\small F_1$ metrics on two adversarial SQuAD datasets. Reinforced Mnemonic Reader, based on Bi-LSTM, is an enhanced attention reader with two main contributions: (i) a reattention mechanism, introduced to alleviate the problems of attention redundancy and deficiency in multi-round alignment architectures, and (ii) a dynamic-critical reinforcement learning approach, to address the convergence suppression problem that exists in traditional reinforcement learning methods.

arxiv1705.02798c.png

[Image source. Click image to open in new window.]


arxiv1705.02798e.png

[Image source. Click image to open in new window.]


In April 2018 IBM Research introduced a new dataset for reading comprehension (DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension)  [projectdata].  DuoRC is a large scale reading comprehension (RC) dataset of 186K human-generated QA pairs created from 7,680 pairs of parallel movie plots taken from Wikipedia and IMDb. By design, DuoRC ensures very little or no lexical overlap between the questions created from one version and segments containing answers in the other version. Essentially, this is a paraphrase dataset, which should be very useful for training reading comprehension models. For example, the authors observed that state of the art neural reading comprehension models that achieved near human performance on the SQuAD dataset exhibited very poor performance on the DuoRC dataset ($\small F_1$ scores of 37.42% on DuoRC vs. 86% on SQuAD), opening research avenues in which DuoRC could complement other RC datasets exploration of novel neural approaches to studying language understanding.

DuoRC might be a useful dataset for training sentence embedding approaches to natural language tasks such as machine translation, document classification, sentiment analysis, etc. In this regard, note that the Conclusions section in Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition stated: “The main limitation of the current work is that there are insufficient paraphrase sentences for training the models. ”

arxiv1804.07927.png

[Image source. Click image to open in new window.]


In a very thorough and thoughtful analysis, Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (Aug 2018) [code] proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance.

arxiv-1808.08744.png

[Image source. Click image to open in new window.]


arxiv-1808.08744b.png

[Image source. Click image to open in new window.]


Highlights from this work are [substantially] paraphrased here:

  • They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy.

  • All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

  • The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction.

  • The sentence attention allowed them to get more insight into the models’ inner state. For example, it allowed them to check whether the model actually focused on relevant sentences in order to answer the questions. Both model variants [CNN; RNN-LSTM] paid most attention to the relevant plot sentences for 70% of the cases. Identifying the relevant sentences was an important success factor: relevant sentences were ranked highest only in 35% of the incorrectly solved questions.

  • Textual entailment was required to solve 60% of the questions …

  • The process of elimination and heuristics proved essential to solve 44% of the questions …

  • Referential knowledge was presumed in 36% of the questions …

  • Furthermore, it was apparent that many questions expected a combination of various reasoning skills.

  • In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

  • Finally, their intensive analysis on the differences between the model and human inference suggest that both models seem to learn matching patterns to select the right answer rather than performing plausible inferences as humans do. The results of these studies also imply that other human like processing mechanism such as referential relations, implicit real world knowledge, i.e., entailment, and answer by elimination via ranking plausibility Hummel and Holyoak, 2005 should be integrated in the system to further advance machine reading comprehension.




Collectively, those publications indicate the difficulty in achieving robust reading comprehension, and the need to develop new models that understand language more precisely. Addressing this challenge will require employing more difficult datasets (like SQuAD2.0) for various tasks, evaluation metrics that can distinguish real intelligent behavior from shallow pattern matching, a better understanding of the response to adversarial attack, and the development of more sophisticated models that understand language at a deeper level.

  • The need for more challenging datasets was echoed in the “Creating harder datasets” subsection in Sebastian Ruder’s ACL 2018 Highlights summary.

    In order to evaluate under such settings, more challenging datasets need to be created. Yejin Choi argued during the RepL4NLP panel discussion (a summary can be found here) that the community pays a lot of attention to easier tasks such as SQuAD or bAbI, which are close to solved. Yoav Goldberg even went so far as to say that “SQuAD is the MNIST of NLP ”.

    Instead, we should focus on solving harder tasks and develop more datasets with increasing levels of difficulty. If a dataset is too hard, people don’t work on it. In particular, the community should not work on datasets for too long as datasets are getting solved very fast these days; creating novel and more challenging datasets is thus even more important. Two datasets that seek to go beyond SQuAD for reading comprehension were presented at the conference:

    Richard Socher also stressed the importance of training and evaluating a model across multiple tasks during his talk during the Machine Reading for Question Answering workshop. In particular, he argues that NLP requires many types of reasoning, e.g. logical, linguistic, emotional, etc., which cannot all be satisfied by a single task.

  • Read + Verify: Machine Reading Comprehension with Unanswerable Questions (Sep 2018) proposed a novel read-then-verify system that combined a base neural reader with a sentence-level answer verifier trained to (further) validate if the predicted answer was entailed by input snippets. They also augmented their base reader with two auxiliary losses to better handle answer extraction and no-answer detection respectively, and investigated three different architectures for the answer verifier. On the SQuAD2.0 dataset their system achieved a $\small F_1$ score of 74.8 on the development set (ca. August 2018), outperforming the previous best published model by more than 7 points (and the best reported model by ~3.5 points (2018-08-20: SQuAD2.0 Leaderboard).

    arxiv1808.05759a.png

    [Image source. Click image to open in new window.]


    arxiv1808.05759b.png

    [Image source. Click image to open in new window.]


In addition to SQuAD2.0 and DuoRC, other recent datasets related to question-answering and reasoning include:

  • Facebook AI Research’s bAbI, a set of 20 tasks for testing text understanding and reasoning described in detail in the paper by Jason Weston et al., Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks (Dec 2015).

  • University of Pennsylvania’s MultiRC: Reading Comprehension over Multiple Sentences;  [project (2018) code], a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. The goal of this dataset is to encourage the research community to explore approaches that can do more than sophisticated lexical-level matching.

  • Allen Institute for Artificial Intelligence (AI2)’s Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Mar 2018) [projectcode], which presented a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI (Stanford Natural Language Inference Corpus). As noted in their Conclusions:

    “Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community.”

    • ARC was recently used in Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Scientific Question Answering (Oct 2018) by authors at UC San Diego and Microsoft AI Research. Existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper, the authors proposed a retriever-reader model that learned to attend on [via self-attention layers] essential terms during the question answering process via an essential-term-aware “retriever” which first identified the most important words in a question, then reformulated the queries and searches for related evidence, and an enhanced “reader” to distinguish between essential terms and distracting words to predict the answer. On the ARC dataset their model outperformed the existing state of the art [e.g., BiDAF] by 8.1%.

      arxiv1808.09492b.png

      [image source. click image to open in new window.]


      arxiv1808.09492a.png

      [image source. click image to open in new window.]


      arxiv1808.09492c.png

      [image source. click image to open in new window.]





Among the many approaches to QA applied to textual sources, the attentional long short-term memory (LSTM)-based and Bi-LSTM, memory based implementations of Richard Socher (SalesForce) are particularly impressive:

  • Ask Me Anything: Dynamic Memory Networks for Natural Language Processing (Jun 2015; updated Mar 2016) by Richard Socher (MetaMind) introduced the Dynamic Memory Network (DMN), a neural network architecture that processed input sequences and questions, formed episodic memories, and generated relevant answers. Questions triggered an iterative attention process that allowed the model to condition its attention on the inputs and the result of previous iterations. These results were then reasoned over in a hierarchical recurrent sequence model to generate answers. [For an good overview of the DMN approach, see slides 39-47 in Neural Architectures with Memory.]

    arxiv-1506.07285.png

    [Image source. Click image to open in new window.]


  • Based on analysis of DMN (above), in 2016 Richard Socher/MetaMind (later acquired by SalesForce) proposed several improvements to the DMN memory and input modules. Their DMN+ model (Dynamic Memory Networks for Visual and Textual Question Answering (Mar 2016) [discussion]) improved the state of the art on visual and text question answering datasets, without supporting fact supervision. Non-author DMN+ code available on GitHub includes a Theano (Improved-Dynamic-Memory-Networks-DMN-plus) and TensorFlow (Dynamic-Memory-Networks-in-TensorFlow) implementations.

    arxiv-1603.01417a.png

    [Image source. Click image to open in new window.]


    arxiv-1603.01417b.png

    [Image source. Click image to open in new window.]


    arxiv-1603.01417c.png

    [Image source. Click image to open in new window.]


  • Later in 2016, Dynamic Coattention Networks for Question Answering (Nov 2016; updated Mar 2018) [non-author code, on SQuAD2.0] by Richard Socher and colleagues at SalesForce introduced the Dynamic Coattention Network (DCN) for QA. DCN first fused co-dependent representations of the question and the document in order to focus on relevant parts of both, then a dynamic pointing decoder iterated over potential answer spans. This iterative procedure enabled the model to recover from the initial local maxima that correspond to incorrect answers. On the Stanford question answering dataset, a single DCN model improved the previous state of the art from 71.0% $\small F_1$ to 75.9%, while a DCN ensemble obtained a 80.4% $\small F_1$ score.

    arxiv1611.01604a.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604b.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604c.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604d.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604e.png

    [Image source. Click image to open in new window.]


  • Efficient and Robust Question Answering from Minimal Context Over Documents (May 2018) studied the minimal context required to answer a question and found that most questions in existing datasets could be answered with a small set of sentences. The authors (Socher and colleagues) proposed a simple sentence selector to select the minimal set of sentences to feed into the QA model, which allowed the system to achieve significant reductions in training (up to 15 times) and inference times (up to 13 times), with accuracy comparable to or better than the state of the art on SQuAD, NewsQA, TriviaQA and SQuAD-Open. Furthermore, the approach was more robust to adversarial inputs.

    Note the sentence selector in Fig. 2(a):

    “For each QA model, we experiment with three types of inputs. First, we use the full document (FULL). Next, we give the model the oracle sentence containing the groundtruth answer span (ORACLE). Finally, we select sentences using our sentence selector (MINIMAL), using both $\small \text{Top k}$ and $\small \text{Dyn}$. We also compare this last method with TF-IDF method for sentence selection, which selects sentences using n-gram TF-IDF distance between each sentence and the question.”

    arxiv1805.08092a.png

    [Image source. Click image to open in new window.]


    arxiv1805.08092b.png

    [Image source. Click image to open in new window.]


    arxiv1805.08092c.png

    [Image source. Click image to open in new window.]


    arxiv1805.08092d.png

    [Image source. Click image to open in new window.]


In a significant body of work, The Natural Language Decathlon: Multitask Learning as Question Answering (Jun 2018) [codeproject], Richard Socher and colleagues at SalesForce presented a NLP challenge spanning 10 tasks,

  • question answering
  • machine translation
  • summarization
  • natural language inference
  • sentiment analysis
  • semantic role labeling
  • zero-shot relation extraction
  • goal-oriented dialogue
  • semantic parsing
  • commonsense pronoun resolution

    arxiv1806.08730-f1.png

    [Image source. Click image to open in new window.]


… as well as a new Multitask Question Answering Network (MQAN) model [code here and here] that jointly learned all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. The MQAN model took in a question and context document, encoded both with a Bi-LSTM, used dual coattention to condition representations for both sequences on the other, compressed all of this information with another two Bi-LSTM, applied self-attention to collect long-distance dependency, and then used a final two Bi-LSTM to get representations of the question and context. The multi-pointer-generator decoder used attention over the question, context, and previously outputted tokens to decide whether to copy from the question, copy from the context, or generate the answer from a limited vocabulary. MQAN showed improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification.

arxiv1806.08730-fig2.png

[Image source. Click image to open in new window.]


arxiv1806.08730-fig3.png

[Image source. Click image to open in new window.]


arxiv1806.08730-t1.png

[Image source. Click image to open in new window.]


arxiv1806.08730-t2.png

[Image source. Click image to open in new window.]


Understandably, Socher’s work has generated much interest among NLP and ML practitioners, leading to the acquisition of his startup, MetaMind, by Salesforce for $32.8 million in 2016 (Salesforce Reveals It Spent $75 Million on the Three Startups It Bought Last Quarter | Salesforce just bought a machine learning startup that was backed by its CEO Marc Benioff). While those authors will not release the code (per a comment by Richard Socher on reddit), using the search term “1506.07285” there appears to be four repositories on GitHub that attempt to implement his Ask Me Anything: Dynamic Memory Networks for Natural Language Processing model, while a GitHub search for “dynamic memory networks” or “DMN+” returns numerous repositories.

The MemN2N architecture was introduced by Jason Weston (Facebook AI Research) in his highly-cited End-To-End Memory Networks paper [code;  non-author code herehere and here;  discussion here and here]. MemN2N, a recurrent attention model over a possibly large external memory, was trained end-to-end and hence required significantly less supervision during training, making it more generally applicable in realistic settings. The flexibility of the MemN2N model allowed the authors to apply it to tasks as diverse as synthetic question answering (QA) and language modeling (LM). For QA the approach was competitive with memory networks but with less supervision; for LM their approach demonstrated performance comparable to RNN and LSTM on the Penn Treebank and Text8 datasets.

While Weston’s MemN2N model was surpassed (accuracy and tasks completed) on the bAbI English 10k dataset by Socher’s DMN+ – see the “E2E” (End to End) and DMN+ columns in Table 2
Source
in the DMN+ paper – code is available (links above) for the MemN2N model.

A precaution with high-performing but heavily engineered systems is domain specificity: How well do those models transfer to other applications? I encountered this issue in my preliminary work (not shown) where I carefully examined the Turku Event Extraction System [TEES 2.2: Biomedical Event Extraction for Diverse Corpora (2015)]. TEES preformed well but was heavily engineered to perform well in the various BioNLP Challenge tasks in which it participated. Likewise, a June 2018 comment in the AllenNLP GitHub repository, regarding end-to-end memory networks, is of interest:

  • “Why are you guys not using *Dynamic Memory Networks in any of your QA solutions?*

    I’m not a huge fan of the models called “memory networks” – in general they are too tuned to a completely artificial task, and they don’t work well on real data. I implemented the end-to-end memory network ”, for instance, and it has three separate embedding layers (which is absolutely absurd if you want to apply it to real data).

    @DeNeutoy implemented the DMN+. It’s not as egregious as the E2EMN [end-to-end memory network], but still, I’d look at actual papers, not blogs, when deciding what methods actually work. E.g., are there any memory networks on the SQuAD Leaderboard (https://rajpurkar.github.io/SQuAD-explorer/)? On the TriviaQA leaderboard? On the leaderboard of any recent, popular dataset?

    To be fair, more recent “memory networks” have modified their architectures so they’re a lot more similar to things like the gated attention reader, which has actually performed well on real data. But, it sure seems like no one is using them to accomplish state of the art QA on real data these days.”

I believe that the “gated attention reader” mentioned in that comment (above) refers to Gated-Attention Readers for Text Comprehension] (Jun 2016; updated Apr 2017) by Ruslan Salakhutdinov and colleagues.



Retrieve-and-Read: Multi-task Learning of Information Retrieval and Reading Comprehension (Aug 2018) presented an interesting approach, “machine reading at scale” (MRS) wherein, given a question, a system retrieves passages relevant to the question from a corpus (IR: information retrieval) and then extracts the answer span from the retrieved passages (RC: reading comprehension). They proposed an approach that incorporated the IR and RC tasks using supervised multi-task learning in order that the IR component could be trained by considering answer spans. Their model directly minimized the joint loss of IR and RC in order that the IR component, which shares the hidden layers with the RC component, could be also trained with correct answer spans. In experiments on answering SQuAD questions using the Wikipedia as the knowledge source, their model achieved state of the art performance [on par with BiDAF ].

arxiv1808.10628a.png

[Image source. Click image to open in new window.]


arxiv1808.10628b.png

[Image source. Click image to open in new window.]


  • “Our Retrieve-and-Read model is based on the bi-directional attention flow (BiDAF ) model, which is a standard RC model. As shown in Figure 2 [above] it consists of six layers: … We note that the RC component trained with single-task learning is essentially equivalent to BiDAF, except for the word embedding layer that has been modified to improve accuracy. … Note that the original BiDAF uses a pre-trained GloVe and also trains character-level embeddings by using a CNN in order to handle out-of-vocabulary (OOV) or rare words. Instead of using GloVe and CNN, our model uses fastText for the fixed pre-trained word vectors and removes character-level embeddings. The fastText model takes into account subword information and can obtain valid representations even for OOV words.”

In 2016 the Allen Institute for Artificial Intelligence introduced the Bi-Directional Attention Flow (BiDAF) framework (Bidirectional Attention Flow for Machine Comprehension (Nov 2016; updated Jun 2018) [projectcodedemo]). BiDAF was a multi-stage hierarchical process that represented context at different levels of granularity and used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.

BiDAF was subsequently used in QA4IE : A Question Answering based Framework for Information Extraction (Apr 2018) [note Table 7
Source
; code], a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, along with a knowledge base (Wikipedia Ontology) for entity recognition. QA4IE processed entire documents as a whole, rather than separately processing individual sentences. Because QA4IE was designed to produce sequence answers in IE settings, QA4IE was outperformed by BiDAF on the SQuAD dataset (Table 3
Source
in QA4IE). Conversely, QA4IE outperformed QA systems – including BiDAF – across 6 datasets in IE settings (Table 4
Source
in QA4IE).

BiDAF:

arxiv-1611.01603.png

[BiDAF. Image source. Click image to open in new window.]


QA4IE:

arxiv-1804.03396a.png

[QA4IE. Image source. Click image to open in new window.]


arxiv-1804.03396b.png

[QA4IE. Image source. Click image to open in new window.]


arxiv-1804.03396c.png

[QA4IE. Image source. Click image to open in new window.]


  • A major difference between question answering (QA) settings and information extraction settings is that in QA settings each query corresponds to an answer, while in the QA4IE framework the QA model takes a candidate entity-relation (or entity-property) pair as the query and it needs to tell whether an answer to the query can be found in the input text.

In other work relating to Bi-LSTM-based question answering, IBM Research and IBM Watson published a paper, Improved Neural Relation Detection for Knowledge Base Question Answering (May 2017), which focused on relation detection via deep residual Bi-LSTM networks to compare questions and relation names. The approach broke the relation names into word sequences for question-relation matching, built both relation level and word level relation representations, used deep BiLSTMs to learn different levels of question representations in order to match the different levels of relation information, and finally used a residual learning method for sequence matching. This made the model easier to train and resulted in more abstract (deeper) question representations, thus improving hierarchical matching. Several non-Microsoft implementations are available on GitHub (machine-comprehension; machine-reading-comprehension; and most recently, MSMARCO).

arxiv1704.06194-fig1.png

[Image source. Click image to open in new window.]


arxiv1704.06194-fig2.png

[Image source. Click image to open in new window.]


Making Neural QA as Simple as Possible but Not Simpler (Mar 2017; updated Jun 2017) introduced FastQA, a simple, context/type matching heuristic for extractive question answering. The paper posited that two simple ingredients are necessary for building a competitive QA system: (i) awareness of the question words while processing the context, and (ii) a composition function (such as recurrent neural networks) which goes beyond simple bag-of-words modeling. In follow-on work, these authors applied FastQA to the biomedical domain (Neural Domain Adaptation for Biomedical Question Answering;  [code]). Their system – which did not rely on domain-specific ontologies, parsers or entity taggers – achieved state of the art results on factoid questions, and competitive results on list questions.

arxiv1703.04816-fig2.png

[Image source. Click image to open in new window.]


arxiv1703.04816-fig1.png

[Image source. Click image to open in new window.]


A recent review, Comparative Analysis of Neural QA Models on SQuAD (Jun 2018), reported that models based on a gated attention mechanism (R-Net ), or a GRU (DocQA ), performed well across a variety of tasks.

Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension (Jun 2017; updated Jan 2018), a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. Unlike the SQuAD dataset that aimed to answer a question with exact text spans in a passage, the MS-MARCO dataset defined the task as answering a question from multiple passages and the words in the answer are not necessary in the passages. The Microsoft Research approach employed bidirectional gated recurrent units (BiGRU) instead of RNN (Bi-LSTM). The answer extraction model was first employed to predict the most important sub-spans from the passage as evidence, which the answer synthesis model took as additional features along with the question and passage to further elaborate the final answers. They built the answer extraction model for single passage reading comprehension, and proposed an additional task of ranking the single passages to help in answer extraction from multiple passages.

arxiv1706.04815-fig1.png

[Image source. Click image to open in new window.]


arxiv1706.04815-fig2.png

[Image source. Click image to open in new window.]


arxiv1706.04815-fig3.png

[Image source. Click image to open in new window.]


arxiv1706.04815-tables2+3.png

[Image source. Click image to open in new window.]


Likewise (regarding evidence based answering), textual entailment with neural attention methods could also be applied; for example, as described in DeepMind’s Reasoning about Entailment with Neural Attention.

Robust and Scalable Differentiable Neural Computer for Question Answering (Jul 2018) [code] was designed as a general problem solver which could be used in a wide range of tasks. Their GitHub repository contains a implementation of an Advanced Differentiable Neural Computer (ADNC) for a more robust and scalable usage in question answering (differentiable neural computers are discussed elsewhere in this REVIEW). The ADNC was applied to the 20 bAbI QA tasks, with state of the art mean results, and to the CNN Reading Comprehension Task with passable results without any adaptation or hyperparameter tuning. Coauthor Jörg Franke’s Master’s Thesis contains additional detail.

arxiv-1807.02658c.png

[Image source. Click image to open in new window.]


In March 2018 Studio Ousia published a question answering model, Studio Ousia’s Quiz Bowl Question Answering System  [slidesmedia]. The embedding approach described in that paper was very impressive, with the ability to “reason” over passages such as the one shown in Table 1 [presented in the summary images, below]. Trained on their Wikipedia2Vec (Wikipedia) pretrained word embeddings, this model very convincingly won the Human-Computer Question Answering Competition (HCQA) at NIPS 2017, scoring more than double the combined human team score (465 to 200 points). A commercial entity, there was no code release.

arxiv1803.08652-table1.png

[Image source. Click image to open in new window.]


arxiv1803.08652-fig1+2.png

[Image source. Click image to open in new window.]


arxiv1803.08652-fig3.png

[Image source. Final: Human: 200 : Computer: 465 points. Click image to open in new window.]


In June 2018 Studio Ousia and colleagues at the Nara Institute of Science and Technology, RIKEN AIP, and Keio University published Representation Learning of Entities and Documents from Knowledge Base Descriptions  [code], which described TextEnt, a neural network model that learned distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, they trained their model to predict the entity that the document described, and map the document and its target entity close to each other in a continuous vector space. Their model, which was trained using a large number of documents extracted from Wikipedia, was evaluated using two tasks: (i) fine-grained entity typing, and (ii) multiclass text classification. The results demonstrated that their model achieved state of-the-art performance on both tasks.

arxiv1806.02960-fig1.png

[Image source. Click image to open in new window.]


arxiv1806.02960-fig2.png

[Image source. Click image to open in new window.]


arxiv1806.02960-table5.png

[Image source. Click image to open in new window.]


Based on the model architectures (above/below), it appears the Studio Ousia Quiz Bowel is based on their TextEnt work?

StudioOusia2018.png

[Image sourceImage source 2. Click image to open in new window.]


In Question Answering and Reading Comprehension (Sep 2018) [code], the authors (Tencent AI Lab) posited that there are three modalities in the reading comprehension setting: question, answer and context. The task of question answering or question generation aims to infer an answer or a question when given the counterpart based on context. They presented a novel two-way neural sequence transduction model that connected the three modalities, allowing it to learn two tasks simultaneously that mutually benefitted one another. Their Dual Ask-Answer Network (DAANet) model architecture comprised a hierarchical process involving a neural sequence transduction model that received string sequences as input and processed them through four layers: embedding, encoding, attention and output. During training, the model received question-context-answer triplets as input and captured the cross-modal interactions via a hierarchical attention process. Unlike previous joint learning paradigms that leveraged the duality of question generation and question answering tasks at the data level, they addressed that duality at the architecture level by mirroring the network structure, and partially sharing components at the different layers. This enabled the knowledge to be transferred from one task to another, helping the model find a general representation for each modality. Evaluation on four datasets showed that their dual-learning model outperformed their mono-learning counterparts – as well as state of the art joint baseline models – on both question answering and question generation tasks.

arxiv1809.01997-fig1.png

[Image source. Click image to open in new window.]


arxiv1809.01997-fig2.png

[Image source. Click image to open in new window.]


arxiv1809.01997-fig5.png

[Image source. Click image to open in new window.]


arxiv1809.01997-table3.png

[Image source. Click image to open in new window.]


arxiv1809.01997-table4.png

[Image source. Click image to open in new window.]


Interpreting phrases such as “Who did what to whom? ” is a major focus in natural language understanding, specifically, semantic role labeling. I Know What You Want: Semantic Learning for Text Comprehension (Sep 2018) attempted to employ semantic role labeling to enhance text comprehension and natural language inference through specifying verbal arguments and their corresponding semantic roles. Embeddings were enhanced by semantic role labels, giving more fine-grained semantics: the salient labels could be conveniently added to existing models, significantly improving deep learning models in challenging text comprehension tasks. This work showed the effectiveness of semantic role labeling in text comprehension and natural language inference, and proposed an easy and feasible scheme to integrate semantic role labeling information in neural models. Experiments on benchmark machine reading comprehension and inference datasets verified that the proposed semantic learning helped their system attain a significant improvement over state of the art, baseline models. [“We will make our code and source publicly available soon.” Not available, 2018-10-16.]

arxiv1809.02794-fig1.png

[Image source. Click image to open in new window.]


arxiv1809.02794-fig2.png

[Image source. Click image to open in new window.]


arxiv1809.02794-fig3.png

[Image source. Click image to open in new window.]


ELMo word embeddings were employed. The dimension of embedding was a critical hyperparameter that influenced the performance: too high of a dimension caused severe overfitting issues, while a dimension that was too low caused underfitting; 5-dimension semantic role label embedding gave the best performance on both the SNLI and SQuAD datasets.

As employed in I Know What You Want: Semantic Learning for Text Comprehension, ELMo embeddings were also used by Ouchi et al. in A Span Selection Model for Semantic Role Labeling (Oct 2018) [code] in a Bi-LSTM, span-based model that employed a IOB/BIO tagging approach. Typically, in this approach, models firstly identify candidate argument spans (argument identification) and then classify each span into one of the semantic role labels (argument classification). In related recent work (Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling), He et al.(2018) also proposed a span-based SRL model similar to Ouchi et al.’s I Know What You Want: Semantic Learning for Text Comprehension. While He et al. also used Bi-LSTM to induce span representations in an end-to-end fashion, a main difference was that while they model $\small P(r | i,j)$, Ouchi et al. modeled $\small P(i,j | r)$. In other words, while He et al.’s model sought to select an appropriate label for each span (label selection), Ouchi et al.’s model selected appropriate spans for each label (span selection).

Ouchi et al.:

arxiv-1810.02245.png

[Image source. Click image to open in new window.]


He et al.:

arxiv-1805.04787.png

[Image source. Click image to open in new window.]


Model/results – Ouchi et al.:

arxiv1810.02245-fig1.png

[Image source. Click image to open in new window.]


arxiv1810.02245-table3.png

[Image source. Click image to open in new window.]


Model/results – He et al.:

arxiv1805.04787-fig2+3.png

[Image source. Click image to open in new window.]


arxiv1805.04787-fig4+tables1+2.png

[Image source. Click image to open in new window.]


Question Answering by Reasoning Across Documents with Graph Convolutional Networks (Aug 2018) introduced a method (Entity-GCN ) which reasons on information spread within/across documents, framed as an inference problem on a graph. Their approach differed from BiDAF and FastQA, which merely concatenate all documents into a single long text and train a standard reading comprehension model. Instead, they framed question answering as an inference problem on a graph representing the document collection.

Machine reading comprehension with unanswerable questions is a new challenging task for natural language processing. A key subtask is to reliably predict whether the question is unanswerable. U-Net: Machine Reading Comprehension with Unanswerable Questions (Oct 2018) proposed a unified model (U-Net) with three important components: answer pointer, no-answer pointer, and answer verifier. They introduced a universal node and thus processed the question and its context passage as a single contiguous sequence of tokens. The universal node encoded the fused information from both the question and passage, and played an important role in predicting whether the question was answerable. Different from other state of the art pipeline models, U-Net could be learned in an end-to-end fashion. Experimental results on the SQuAD2.0 dataset showed that U-Net could effectively predict the unanswerability of questions, achieving an $\small F_1$ score of 71.7 on SQuAD2.0.

arxiv1810.06638-t1.png

[Image source. Click image to open in new window.]


arxiv1810.06638-f1.png

[Image source. Click image to open in new window.]


arxiv1810.06638-t2.png

[Image source. Click image to open in new window.]


“Our model achieves an $\small F_1$ score of 74.0 and an EM score of 70.3 on the development set, and an $\small F_1$ score of 72.6 and an EM score of 69.2 on Test set 1, as shown in Table 2. Our model outperforms most of the previous approaches. Comparing to the best-performing systems, our model has a simple architecture and is an end-to-end model. In fact, among all the end-to-end models, we achieve the best $\small F_1$ scores. We believe that the performance of the U-Net can be boosted with an additional post-processing step to verify answers using approaches such as (Hu et al. 2018).”

Text embedding representing natural language documents in a semantic vector space can be used for document retrieval using nearest neighbor lookup. Text Embeddings for Retrieval From a Large Knowledge Base (Oct 2018; Christian Szegedy at Google Inc. and authors at the University of Arkansas) studied the feasibility of neural models specialized for retrieval in a semantically meaningful way, suggesting the use of SQuAD in an open-domain question answering context where the first task was to find paragraphs useful for answering a given question. They first compared the quality of various text-embedding methods on the performance of retrieval, and gave an empirical comparisons on the performance of various non-augmented base embeddings with/without IDF weighting. Training deep residual neural models specifically for retrieval purposes yielded significant gains when it was used to augment existing embeddings. They also established that deeper models were superior to this task. The best base baseline embeddings augmented by their learned neural approach improved the top-1 paragraph recall of the system by 14%.

arxiv1810.10176-f1.png

[Image source. Click image to open in new window.]


arxiv1810.10176-f2.png

[Image source. Click image to open in new window.]


arxiv1810.10176-t9+t10.png

[Image source. Click image to open in new window.]


Improving Machine Reading Comprehension with General Reading Strategies (Oct 2018) proposed three simple domain-independent strategies aimed to improve non-extractive machine reading comprehension (MRC):

  • BACK AND FORTH READING, which considers both the original and reverse order of an input sequence,
  • HIGHLIGHTING, which adds a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers, and
  • SELF-ASSESSMENT, which generates practice questions and candidate answers directly from the text in an unsupervised manner.

“By fine-tuning a pre-trained language model (Radford et al., 2018) [OpenAI’s Finetuned Transformer LM] with our proposed strategies on the largest existing general domain multiple-choice MRC dataset RACE, we obtain a 5.8% absolute increase in accuracy over the previous best result achieved by the same pre-trained model fine-tuned on RACE without the use of strategies. We further fine-tune the resulting model on a target task, leading to new state-of-the-art results on six representative non-extractive MRC datasets from different domains (i.e., ARC, OpenBookQA, MCTest, MultiRC, SemEval-2018, and ROCStories). These results indicate the effectiveness of the proposed strategies and the versatility and general applicability of our fine-tuned models that incorporate these strategies.”

arxiv1810.13441-f1.png

[Image source. Click image to open in new window.]


arxiv1810.13441-t2.png

[Image source. Click image to open in new window.]


Machine Reading Comprehension (MRC) with multiple-choice questions requires the machine to read given passage and select the correct answer among several candidates. Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions (Nov 2018) proposed a novel approach called Convolutional Spatial Attention (CSA) which could better handle the MRC with multiple-choice questions. The proposed model could fully extract the mutual information among the passage, question, and the candidates, to form the enriched representations. Furthermore, to merge various attention results, they proposed to use convolutional operations to dynamically summarize the attention values within the different size of regions. Experimental results showed that the proposed model could give substantial improvements over various state of the art systems on both the RACE and SemEval-2018 Task11 datasets.

arxiv1811.08610-f2.png

[Image source. Click image to open in new window.]


arxiv1811.08610-f1+f4+t1+t3.png

[Image source. Click image to open in new window.]


A Deep Cascade Model for Multi-Document Reading Comprehension (Nov 2018) developed a novel deep cascade learning model, which progressively evolved from the document-level and paragraph-level ranking of candidate texts to more precise answer extraction with machine reading comprehension. Irrelevant documents and paragraphs were first filtered out with simple functions for efficiency consideration. They then jointly trained three modules on the remaining texts for better tracking the answer: document extraction, paragraph extraction and answer extraction. Experiment results showed that the proposed method outperformed the previous state of the art methods on two large-scale multi-document benchmark datasets: TriviaQA and DuReader. Their online system could stably serve millions of daily requests in less than 50ms.

arxiv1811.11374-f1+t1+t2.png

[Image source. Click image to open in new window.]


arxiv1811.11374-f2.png

[Image source. Click image to open in new window.]


Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering (Nov 2018) described a novel hierarchical attention network for reading comprehension style question answering, which aimed to answer questions for a given narrative paragraph. In the proposed method, attention and fusion were conducted horizontally and vertically across layers at different levels of granularity between question and paragraph. It first encoded the question and paragraph with fine-grained language embeddings, to better capture the respective representations at semantic level. It proposed a multi-granularity fusion approach to fully fuse information from both global and attended representations. Finally, it introduced a hierarchical attention network to focuses on the answer span progressively with multi-level soft-alignment. Extensive experiments on the large-scale SQuAD and TriviaQA datasets validated the effectiveness of the proposed method, a high rank on the SQuAD Leaderboard for both single and ensemble models, while also achieving state of the art results on the TriviaQA, AddSent and AddOne-Sent datasets.

arxiv1811.11934-f1+t1-thru-t6.png

[Image source. Click image to open in new window.]


Question Answering and Reading Comprehension:

Additional Reading

  • Textbook Question Answering with Knowledge Graph Understanding and Unsupervised Open-set Text Comprehension (Nov 2018)

    “In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of TQA dataset. First, it requires to comprehend long lessons to extract knowledge. To tackle this issue of extracting knowledge features from long lessons, we establish knowledge graph from texts and incorporate graph convolutional network (GCN). Second, scientific terms are not spread over the chapters and data splits in TQA dataset. To overcome this so called `out-of-domain’ issue, we add novel unsupervised text learning process without any annotations before learning QA problems. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating GCN for extracting knowledge from long lessons and our newly proposed unsupervised learning process are meaningful to solve this problem.”

    arxiv1811.00232-f1+f4.png

    [Image source. Click image to open in new window.]


    arxiv1811.00232-f3.png

    [Image source. Click image to open in new window.]


    arxiv1811.00232-t1.png

    [Image source. Click image to open in new window.]


    arxiv1811.00232-f5.png

    [Image source. Click image to open in new window.]
  • Exploiting Sentence Embedding for Medical Question Answering (Nov 2018)

    “Despite the great success of word embedding, sentence embedding remains a not-well-solved problem. In this paper, we present a supervised learning framework to exploit sentence embedding for the medical question answering task. The learning framework consists of two main parts: (1) a sentence embedding producing module, and (2) a scoring module. The former is developed with contextual self-attention and multi-scale techniques to encode a sentence into an embedding tensor. This module is shortly called Contextual self-Attention Multi-scale Sentence Embedding (CAMSE). The latter employs two scoring strategies: Semantic Matching Scoring (SMS) and Semantic Association Scoring (SAS). SMS measures similarity while SAS captures association between sentence pairs: a medical question concatenated with a candidate choice, and a piece of corresponding supportive evidence. The proposed framework is examined by two Medical Question Answering (MedicalQA) datasets which are collected from real-world applications: medical exam and clinical diagnosis based on electronic medical records (EMR). The comparison results show that our proposed framework achieved significant improvements compared to competitive baseline approaches. Additionally, a series of controlled experiments are also conducted to illustrate that the multi-scale strategy and the contextual self-attention layer play important roles for producing effective sentence embedding, and the two kinds of scoring strategies are highly complementary to each other for question answering problems.”

  • Multi-Task Learning with Multi-View Attention for Answer Selection and Knowledge Base Question Answering (Dec 2018)


[Table of Contents]

Probing the Nature (Transparency) of Reasoning Architectures

Compositional Attention Networks for Machine Reasoning (Apr 2018) [code] by Drew Hudson and Christopher Manning presented the MAC network, a novel fully differentiable neural network architecture designed to facilitate explicit and expressive reasoning. MAC moved away from monolithic black-box neural architectures toward a design that encouraged both transparency and versatility. The model approached problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintained a separation between control and memory. By stringing the cells together and imposing structural constraints that regulated their interaction, MAC effectively learned to perform iterative reasoning processes that were directly inferred from the data in an end-to-end approach. They demonstrated the model’s strength, robustness and interpretability on the challenging CLEVR dataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy, halving the error rate of the previous best model. More importantly, they showed that the model was computationally efficient and data efficient, in particular requiring 5x less data than existing models to achieve strong results.

arxiv1803.03067-f1.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f2.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f3.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f4.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f5.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f6.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f7.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f8.png

[Image source. Click image to open in new window.]


arxiv1803.03067-t1.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f11.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f13.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f14.png

[Image source. Click image to open in new window.]


Manning’s MAC Net is a compositional attention network designed for visual question answering (VQA). In a very similar approach [Compositional Attention Networks for Interpretability in Natural Language Question Answering (Oct 2018)], Saama AI Research (India) proposed a modified MAC Net architecture for natural language question answering. Question Answering typically requires language understanding and multistep reasoning. MAC Net’s unique architecture – the separation between memory and control – facilitated data-driven iterative reasoning, making it an ideal candidate for solving tasks that involve logical reasoning. Experiments with the 20 bAbI tasks demonstrated the value of MAC Net as a data efficient and interpretable architecture for natural language question answering. The transparent nature of MAC Net provided a highly granular view of the reasoning steps taken by the network in answering a query.

arxiv1810.12698-f1+f2+f3+f5.png

[Image source. Click image to open in new window.]


arxiv1810.12698-f4+f9+f10.png

[Image source. Click image to open in new window.]


[Table of Contents]

Probing the Shortcomings of Shallow Trained Language Models

While on the surface LSTM based approaches generally appear to perform well for memory and recall, upon deeper inspection they can also display significant limitations. For example, around mid-2018 I conducted a cursory examination of the BiDAF/SQuAD question answering model online demo  [alternate site], in which I found that their BiDAF model performed well on some queries but failed on other semantically and syntactically identical questions (e.g. with changes in character case, or punctuation), as well as queries on entities not present in the text. While BiDAF a employed hierarchical multi-stage process consisting of six layers (character embedding, word embedding, contextual embedding, attention flow, modeling and output layers), it employed GloVe pretrained word vectors for the word embedding layer to map each word to a high-dimensional vector space (a fixed embedding of each word). This led me to suspect that the shallow embeddings encoded in the GloVe pretrained word vectors failed to capture the nuances of processed text.

[excerpted/paraphrased from NLP’s ImageNet moment has arrived]:

“Pretrained word vectors have brought NLP a long way. Proposed in 2013 as an approximation to language modeling, word2vec found adoption through its efficiency and ease of use … word embeddings pretrained on large amounts of unlabeled data via algorithms such as word2vec and GloVe are used to initialize the first layer of a neural network, the rest of which is then trained on data of a particular task. … Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.

“Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

“At the core of the recent advances of ULMFiT, ELMo, and the Finetuned Transformer LM is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.”

Recent discussions by Stanford University researchers (Adversarial Examples for Evaluating Reading Comprehension Systems (Jul 2017) [codediscussion]) are also highly apropos to this issue, motivating related research jointly by investigators at the University of Chicago, Google, and Google Brain (Did the Model Understand the Question? (May 2018) [code]). Adversarial challenges to SQuAD1.1 (e.g., by adding adversarially inserted sentences to text passages without changing the correct answer or misleading humans) easily distracted recurrent neural network/attentional-based algorithms like BiDAF and LSTM, leading to incorrect answers. Additionally, although deep learning networks were quite successful overall, they often ignored important question terms and were easily perturbed by adversarial-modified content – again giving incorrect answers.

Stanford (2017):

arxiv1707.07328-fig1.png

[Image source. Click image to open in new window.]


arxiv1707.07328-fig2.png

[Image source. Click image to open in new window.]


Google (2018):

arxiv1805.05492-table4.png

[Image source. Click image to open in new window.]


Unquestionably, LSTM based language models have been important drivers of progress in NLP, as reviewed in

LSTM are commonly employed for textual summarization, question answering, natural language understanding, natural language inference, and commonsense reasoning tasks. Increasingly however, NLP researchers and practitioners have questioning both the relevance and performance of RNN/LSTM as models for learning natural language. In this regard, Sebastian Ruder included these comments in his recent post, ACL 2018 highlights:

Another way to gain a better understanding of a [NLP] model is to analyze its inductive bias. The “Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP” (RELNLP) sought to explore how useful it is to incorporate linguistic structure into our models. One of the key points of Chris Dyer’s talk during the workshop was whether RNNs have a useful inductive bias for NLP. In particular, he argued that there are several pieces of evidence indicating that RNNs prefer sequential recency, namely:

  • Gradients become attenuated across time. LSTMs or GRUs may help with this, but they also forget.
  • People have used training regimes like reversing the input sequence for machine translation.
  • People have used enhancements like attention to have direct connections back in time.
  • For modeling subject-verb agreement, the error rate increases with the number of attractors [Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies (Nov 2016)]

According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don’t seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour. Recurrent neural network grammars, a class of models that generates both a tree and a sequence sequentially by compressing a sentence into its constituents, instead have a bias for syntactic (rather than sequential) recency [Recurrent Neural Network Grammars (Oct 2016)]. However, it can often be hard to identify whether a model has a useful inductive bias. For identifying subject-verb agreement, Chris hypothesizes that LSTM language models learn a non-structural “first noun” heuristic that relies on matching the verb to the first noun in the sentence. In general, perplexity (and other aggregate metrics) are correlated with syntactic/structural competence, but are not particularly sensitive at distinguishing structurally sensitive models from models that use a simpler heuristic.

Understanding the failure modes of LSTMs

Better understanding representations was also a theme at the Representation Learning for NLP workshop. During his talk, Yoav Goldberg detailed some of the efforts of his group to better understand representations of RNNs. In particular, he discussed his recent work on extracting a finite state automaton from an RNN in order to better understand what the model has learned [Weiss, Goldberg & Yahav: Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples (Jun 2018) … “In this work, however, we will focus on GRUs (Cho et al., 2014; Chung et al., 2014) and LSTMs (Hochreiter & Schmidhuber, 1997), as they are more widely used in practice.”]. He also reminded the audience that LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data. Even when a model has been trained using a domain-adversarial loss to produce representations that are invariant of a certain aspect, the representations will be still slightly predictive of said attribute. It can thus be a challenge to completely remove unwanted information from encoded language data and even seemingly perfect LSTM models may have hidden failure modes. On the topic of failure modes of LSTMs, a statement that also fits well in this theme was uttered by this year’s recipient of the ACL lifetime achievement award, Mark Steedman. He asked ‘LSTMs work in practice, but can they work in theory?’

A UC-Berkeley paper by John Miller and Moritz Hardt, When Recurrent Models Don’t Need To Be Recurrent (May 2018) [author’s discussiondiscussion), studied the gap between recurrent and feedforward models trained using gradient descent. They proved that stable RNN are well approximated by feedforward networks for the purpose of both inference and training by gradient descent. If the recurrent model is stable (meaning the gradients can not explode), then the model can be well-approximated by a feedforward network for the purposes of both inference and training. In other words, they showed feedforward and stable recurrent models trained by gradient descent are equivalent in the sense of making identical predictions at test-time. [Of course, not all models trained in practice are stable: they also gave empirical evidence the stability condition could be imposed on certain recurrent models without loss in performance.]

Autoregressive, feed-forward model: Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts $\small y_t$ using only the $\small k$ most recent inputs, $\small x_{t-k+1}, \ldots, x_t$. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the $\small k$ most recent inputs. Google’s WaveNet nicely illustrates this general principle. [Source: When Recurrent Models Don’t Need to be Recurrent]

WaveNet.gif

[Image source. Click image to open in new window.]


Recurrent models feature flexibility and expressivity that come at a cost. Empirical experience shows that RNNs are often more delicate to tune and more brittle to train than standard feedforward architectures. Recurrent architectures can also introduce significant computational burden compared with feedforward implementations. In response to these shortcomings, a growing line of empirical research demonstrates that replacing recurrent models by feedforward models is effective in important applications including translation, speech synthesis, and language modeling (When Recurrent Models Don't Need To Be Recurrent). In contrast to an RNN, the limited context of a feedforward model means that it cannot capture patterns that extend more than $\small k$ steps. Although it appears that the trainability and parallelization for feedforward models comes at the price of reduced accuracy, there have been several recent examples showing that feedforward networks can actually achieve the same accuracies as their recurrent counterparts on benchmark tasks, including language modeling, machine translation, and speech synthesis.

With regard to language modeling – in which the goal is to predict the next word in a document given all of the previous words – feedforward models make predictions using only the $\small k$ most recent words, whereas recurrent models can potentially use the entire document. The gated-convolutional language model is a feedforward autoregressive model that is competitive with large LSTM baseline models. Despite using a truncation length of $\small k = 25$, the model outperforms a large LSTM on the WikiText-103 benchmark, which is designed to reward models that capture long-term dependencies. On the Billion Word Benchmark, the model is slightly worse than the largest LSTM, but is faster to train and uses fewer resources. This is perplexing, since recurrent models seem to be more powerful a priori.

When Recurrent Models Don't Need To Be Recurrent coauthor John Miller continues this discussion in his excellent blog post:

  • One explanation for this phenomenon is given by Dauphin et al. in Language Modeling with Gated Convolutional Networks (Sep 2017):

    arxiv1612.08083-fig1.pngttention and Gated (Recurrent) Units

    [Image source. Click image to open in new window.]


    • From that paper:

      Gating has been shown to be essential for recurrent neural networks to reach state-of-the-art performance. Our gated linear units reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities (Section 5.2). We show that gated convolutional networks outperform other recently published language models such as LSTMs trained in a similar setting on the Google Billion Word Benchmark (Chelba et al., 2013). …

      “Gating mechanisms control the path through which information flows in the network and have proven to be useful for recurrent neural networks. LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep. In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers. We show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. …

      “Gated linear units are a simplified gating mechanism based on the work of Dauphin & Grangier [Predicting distributions with Linearizing Belief Networks (Nov 2015; updated May 2016)] for non-deterministic gates that reduce the vanishing gradient problem by having linear units coupled to the gates. This retains the non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. … We compare the different gating schemes experimentally in Section 5.2 and we find gated linear units allow for faster convergence to better perplexities.”

  • Another explanation is given by Bai et al. (Apr 2018): “The unlimited context offered by recurrent models is not strictly necessary for language modeling.”

    In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

  • “The ‘infinite memory’ advantage of RNNs is largely absent in practice.”

    As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM $\small n$-gram model with $\small n=13$ words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

  • “Recurrent models trained in practice are effectively feedforward.”

    This could happen either because truncated backpropagation time cannot learn patterns significantly longer than $\small k$ steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

We know very little about how neural language models (LM) use prior linguistic context. A recent paper by Dan Jurafsky and colleagues at Stanford University, Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context (May 2018) investigated the role of context in a LSTM based LM, through ablation studies. On two standard datasets (Penn Treebank and WikiText-2) they found that the model was capable of using about 200 tokens of context on average, but sharply distinguished nearby context (recent 50 tokens) from the distant history. The model was highly sensitive to the order of words within the most recent sentence, but ignored word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. They further found that the neural caching model (Improving Neural Language Models with a Continuous Cache) especially helped the LSTM copy words from within this distant context. Paraphrased from that paper:

  • “In this analytic study, we have empirically shown that a standard LSTM language model can effectively use about 200 tokens of context on two benchmark datasets, regardless of hyperparameter settings such as model size. It is sensitive to word order in the nearby context, but less so in the long-range context. In addition, the model is able to regenerate words from nearby context, but heavily relies on caches to copy words from far away.”

  • The neural cache model (Improving Neural Language Models with a Continuous Cache (Dec 2016)] augments neural language models with a longer-term memory that dynamically updates the word probabilities based on the long-term context. The neural cache stores the previous hidden states in memory cells for use as keys to retrieve their corresponding (next) word. A neural cache can be added on top of a pretrained language model at negligible cost.

While LSTM has been successfully used to model sequential data of variable length, LSTM can experience difficulty in capturing long-term dependencies. Long Short-Term Memory with Dynamic Skip Connections (Nov 2018) tried to alleviate this problem by introducing a dynamic skip connection, which could learn to directly connect two dependent words. Since there was no dependency information in the training data, they proposed a novel reinforcement learning-based method to model the dependency relationship and connect dependent words. The proposed model computed the recurrent transition functions based on the skip connections, which provided a dynamic skipping advantage over RNNs that always tackle entire sentences sequentially. Experimental results on three NLP tasks demonstrated that the proposed method could achieve better performance than existing methods, and in a number prediction experiment the proposed model outperformed LSTM with respect to accuracy by nearly 20%.

arxiv1811.03873-f2.png

[Image source. Click image to open in new window.]


arxiv1811.03873-t1+t2+t3+t5+f6.png

[Image source. Click image to open in new window.]


Question Answering and Reading Comprehension:

Additional Reading

  • A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC (Sep 2018)

    “Across all of the datasets, there exists at least one other dataset that significantly improves performance on a target dataset. These experiments do not support that direct transfer is possible, but that pretraining is at least somewhat effective. QuAC appears to transfer the least to any of other datasets, likely because questioners were not allowed to see underlying context documents while formulating questions. Since transfer is effective between these related tasks, we recommend that future work indicate any pretraining.”

  • Stochastic Answer Networks for SQuAD 2.0 (Sep 2018) [code]

    arxiv1809.09194-fig1.png

    [Image source. Click image to open in new window.]


    arxiv1809.09194-fig2.png

    [Image source. Click image to open in new window.]


    “To sum up, we proposed a simple yet efficient model based on SAN [Stochastic Answer Network]. It showed that the joint learning algorithm boosted the performance on SQuAD2.0. We also would like to incorporate ELMo into our model in future.”

  • AUEB at BioASQ 6: Document and Snippet Retrieval (Sep 2018) [code]

    arxiv1809.06366-fig6.png

    [Image source. Click image to open in new window.]


    “We presented the models, experimental set-up, and results of AUEB’s submissions to the document and snippet retrieval tasks of the sixth year of the BioASQ challenge. Our results show that deep learning models are not only competitive in both tasks, but in aggregate were the top scoring systems. This is in contrast to previous years where traditional IR systems tended to dominate. In future years, as deep ranking models improve and training data sets get larger, we expect to see bigger gains from deep learning models.”

  • A Knowledge Hunting Framework for Common Sense Reasoning (Oct 2018) [MILA/McGill University; Microsoft Research Montreal] [code]

    “We developed a knowledge-hunting framework to tackle the Winograd Schema Challenge (WSC), a task that requires common-sense knowledge and reasoning. Our system involves a semantic representation schema and an antecedent selection process that acts on web-search results. We evaluated the performance of our framework on the original set of WSC instances, achieving F1-performance that significantly exceeded the previous state of the art. A simple port of our approach to COPA [Choice of Plausible Alternatives] suggests that it has the potential to generalize. In the future we will study how this commonsense reasoning technique can contribute to solving ‘edge cases’ and difficult examples in more general coreference tasks.”

    arxiv-1810.01375a.png

    [Image source. Click image to open in new window.]


    arxiv-1810.01375b.png

    [Image source. Click image to open in new window.]


  • A Fully Attention-Based Information Retriever (Oct 2018) [code]

    • “Recurrent neural networks are now the state-of-the-art in natural language processing because they can build rich contextual representations and process texts of arbitrary length. However, recent developments on attention mechanisms have equipped feedforward networks with similar capabilities, hence enabling faster computations due to the increase in the number of operations that can be parallelized. We explore this new type of architecture in the domain of question-answering and propose a novel approach that we call Fully Attention Based Information Retriever (FABIR). We show that FABIR achieves competitive results in the Stanford Question Answering Dataset (SQuAD) while having fewer parameters and being faster at both learning and inference than rival methods.”

      “The experiments validate that attention mechanisms alone are enough to power an effective question-answering model. Above all, FABIR proved roughly five times faster at both training and inference than BiDAF, a competing RNN-based model with similar performance. … Although FABIR is still far from surpassing the models at the top of the <a href=”https://rajpurkar.github.io/SQuAD-explorer/”green”>SQuAD leaderboard</font></a> (Table III), we believe that its faster and lighter architecture already make it an attractive alternative to RNN-based models, especially for applications with limited processing power or that require low-latency.”

    • Critique.

      Like FABIR (which is also evaluated with the attention module only, minus convolution – giving satisfactory results), QANet (Apr 2018) is a QA architecture that consists entirely of convolution and self-attention, on the SQuAD dataset is 3x to 13x faster in training and 4x to 9x faster in inference on the state of the art at that time, and places highly on the SQuAD1.1 Leaderboard (2018-10-23). However, the FABIR paper [A Fully Attention-Based Information Retriever (Oct 2018)] fails to cite the earlier, more performant QANet work [*QANet*: Combining Local Convolution with Global Self-Attention for Reading Comprehension].*

      Carnegie Mellon University/Google Brain’s *QANet* : Combining Local Convolution with Global Self-Attention for Reading Comprehension” (Apr 2018) begins to address the issue of adversarial challenges to QA. QANet does not require RNN; its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD1.1 dataset [indicated as such on the Leaderboard] their model was 3-13x faster in training and 4-9x faster in inference, while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data. More significantly to this discussion, on the adversarial SQuAD test set QANet achieved significantly improved F1 scores compared to BiDAF and other models (Table 6
      Source
      ), demonstrating the robustness of QANet to adversarial examples.

    arxiv1810.09580-f1.png

    [Image source. Click image to open in new window.]


    arxiv1810.09580-f2.png

    [Image source. Click image to open in new window.]


    arxiv1810.09580-t1.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Natural Language Inference

Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise $\small p$ (e.g. a piece of text) and a hypothesis $\small h$. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI Corpus), contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of NLI. A newer, Multi-Genre Natural Language Inference (MultiNLI corpus) is also available: a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

NLI was one of the 10 tasks proposed in The Natural Language Decathlon: Multitask Learning as Question Answering, a NLP challenge spanning 10 tasks introduced by Richard Socher and colleagues at Salesforce.

Google’s (A Decomposable Attention Model for Natural Language Inference (Parikh et al., Sep 2016) likewise proposed a simple neural architecture for natural language inference that used attention to decompose the problem into subproblems that could be solved separately, thus making it trivially parallelizable. Their use of attention was purely based on word embeddings, essentially consisting of feedforward networks that operated largely independently of word order. On the Stanford Natural Language Inference (SNLI) dataset, they obtained state of the art results with almost an order of magnitude fewer parameters than previous work, without relying on any word-order information. The approach outperformed considerably more complex neural methods aiming for text understanding, suggesting that – at least for that task – pairwise comparisons are relatively more important than global sentence-level representations.

arxiv1606.0193-fig1.png

[Image source. Click image to open in new window.]


arxiv1606.0193-tables1+2.png

[Image source. Click image to open in new window.]


However,

  • that same model (Parikh et al. 2016; see Table 3 in the image below),
  • and also one based on a Bi-LSTM-based single sentence-encoding model without attention (ibid.),
  • and a hybrid TreeLSTM-based and Bi-LSTM-based model with an inter-sentence attention mechanism to align words across sentences (ibid.)

… all performed poorly on the newer “Breaking NLI” NLI test set, indicating the difficulty of the task (and reiterating the need for ever more challenging datasets). The new examples were simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set was substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences. That finding recalls my earlier discussion on adversarial challenges to BiDAF/SQuAD-based QA.

arxiv1805.02266-table3b.png

[Image source. Click image to open in new window.]


arxiv1808.03894-fig1.png

[Image source. Click image to open in new window.]


Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. However, efforts to obtain embeddings for larger chunks of text, such as sentences, have not been as successful: several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. For a long time supervised learning of sentence embeddings was thought to give lower-quality embeddings than unsupervised approaches but this assumption has recently been overturned, in part following the publication of the InferSent model by Facebook AI Research (Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (May 2017; updated Jul 2018) [code;  discussion: A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings, and reddit]). The authors showed how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference (SNLI) datasets could consistently outperform unsupervised methods, like SkipThought vectors, on a wide range of transfer tasks – indicating the suitability of NLI for transfer learning to other NLP tasks. Much like how computer vision used ImageNet to obtain features, which could then be transferred to other tasks, their work indicated the suitability of natural language inference for transfer learning to other NLP tasks.

arxiv1705.02364-f1.png

[Image source. Click image to open in new window.]


arxiv1705.02364-f4.png

[Image source. Click image to open in new window.]


  • InferSent was an interesting approach by the simplicity of its architecture, a bi-directional LSTM complete with a max-pooling operator as sentence encoder. InferSent used the SNLI dataset (a set of of 570k pairs of sentences labeled with 3 categories: neutral, contradiction and entailment) to train the classifier on top of the sentence encoder. Both sentences were encoded using the same encoder, while the classifier was trained on a pair representation constructed from the two sentence embeddings.

  • Investigating the Effects of Word Substitution Errors on Sentence Embeddings (Nov 2018) investigated the effects of word substitution errors, such as those coming from automatic speech recognition errors (ASR), on several state of the art sentence embedding methods. Their results showed that pre-trained encoders such as InferSent were both robust to ASR errors and performed well on textual similarity tasks after errors were introduced.

In a very similar architecture to InferSent (compare the images above/below), Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture (Aug 2018) [code] from the University of Helsinki yielded state of the art results for SNLI sentence encoding-based models and the SciTail dataset, and provided strong results for the MultiNLI dataset. [The SciTail dataset is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs. Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis.] The sentence embeddings could be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7/10 and SkipThought on 8/9 SentEval sentence embedding evaluation tasks. Furthermore, their model beat the InferSent in 8/10 recently published SentEval probing tasks designed to evaluate the ability of sentence embeddings to capture some of the important linguistic properties of sentences.

arxiv1808.08762-f1.png

[Image source. Click image to open in new window.]


arxiv1808.08762-f2.png

[Image source. Click image to open in new window.]


arxiv1808.08762-t9.png

[Image source. Click image to open in new window.]


“The success of the proposed hierarchical architecture raises a number of additional interesting questions. First, it would be important to understand what kind of semantic information the different layers are able to capture. Second, a detailed and systematic comparison of different hierarchical architecture configurations, combining Bi-LSTM and max pooling in different ways, could lead to even stronger results, as indicated by the results we obtained on the SciTail dataset with the modified 4-layered model. Also, as the sentence embedding approaches for NLI focus mostly on the sentence encoder, we think that more should be done to study the classifier part of the overall NLI architecture. There is not enough research on classifiers for NLI and we hypothesize that further improvements can be achieved by a systematic study of different classifier architectures, starting from the way the two sentence embeddings are combined before passing on to the classifier.”

Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. Bridging Knowledge Gaps in Neural Entailment via Symbolic Models (Sep 2018) focused on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Their architecture (NSnet) combined standard neural entailment models with a knowledge lookup module. To facilitate this lookup, they proposed a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. NSnet learned to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSnet outperformed a simpler combination of the two predictions by 3% and the base entailment model by 5%.

arxiv1808.09333-f1.png

[Image source. Click image to open in new window.]


arxiv1808.09333-f2.png

[Image source. Click image to open in new window.]


arxiv1808.09333-t3.png

[Image source. Click image to open in new window.]


arxiv1808.09333-t1.png

[Image source. Click image to open in new window.]


[Natural Language Inference:]

Additional Reading

  • Improving Natural Language Inference Using External Knowledge in the Science Questions Domain (Nov 2018)

    “Natural Language Inference (NLI) is fundamental to many Natural Language Processing (NLP) applications including semantic search and question answering. The NLI problem has gained significant attention thanks to the release of large scale, challenging datasets. Present approaches to the problem largely focus on learning-based methods that use only textual information in order to classify whether a given premise entails, contradicts, or is neutral with respect to a given hypothesis. Surprisingly, the use of methods based on structured knowledge – a central topic in artificial intelligence – has not received much attention vis-a-vis the NLI problem. While there are many open knowledge bases that contain various types of reasoning information, their use for NLI has not been well explored. To address this, we present a combination of techniques that harness knowledge graphs to improve performance on the NLI problem in the science questions domain. We present the results of applying our techniques on text, graph, and text-to-graph based models, and discuss implications for the use of external knowledge in solving the NLI problem. Our model achieves the new state-of-the-art performance on the NLI problem over the SciTail science questions dataset.”

    arxiv1809.05724-t1+desr+f1.png

    [Image source.  SciTail dataset.  Click image to open in new window.]


    arxiv1809.05724-f2.png

    [Image source. Click image to open in new window.]