Copyright Notice / Citation

Copyright © 2018-present, Victoria A. Stuart

Please cite this work as:

    Stuart, Victoria A. (August 2018) "Biomedical Knowledge Discovery in Networks Through Language/Graphical Models and Machine Learning" https://persagen.com/2018/08/08/biokdd.html.

BibTeX:

@TechReport{,
  author      = {Stuart, Victoria A.},
  title       = {Biomedical Knowledge Discovery in Networks Through Language/Graphical Models and Machine Learning},
  institution = {Persagen Consulting (Persagen.com)},
  year        = {2018},
  month       = aug,
  note        = {Technical Report},
  keywords    = {biomedical knowledge discovery; graph convolutional networks; machine learning; natural language understanding; natural language processing; language models; commonsense reasoning; graph signal processing; molecular genetics; cellular metabolism; cellular signaling; cancer; cancer biology; bioinformatics; personalized medicine; networks; graphical models; knowledge graphs; representation learning},
  owner       = {Dr. Victoria A. Stuart, Ph.D.},
  url         = {https://persagen.com/2018/08/08/biokdd.html},
}



[Table of Contents]

PREFACE

The past several years have seen stunning advances in machine learning (ML) and natural language processing (NLP). In this TECHNICAL REVIEW I survey leading approaches to ML and NLP applicable to the biomedical domain, with a particular focus on:

  • construction of structured knowledge stores (commonsense knowledge):

    • textual knowledge stores (ideal for storing reference material)

    • knowledge graphs (ideal for storing relational data)

  • natural language understanding:

    • word embedding (applicable to representation learning, relation extraction/link prediction, language understanding, and knowledge discovery):

    • natural language models (to better understand and leverage natural language and text);

    • natural language inference (recognizing textual entailment: identifying the relationship between a premise and a hypothesis)

    • reading comprehension (ability to process text, understand its meaning, and integrate it with preexisting knowledge);

    • commonsense reasoning (ability to make presumptions about the type and essence of ordinary situations);

    • question answering and recommendation (addressing user inquiry);

  • information overload (including text classification and summarization);

  • transfer learning and multi-task learning (leveraging existing knowledge and models for new tasks);

  • explainable/interpretable models (rendering machine learning decisions/output transparent to human understanding);

  • in silico modeling (using computer models to model biochemical, biomolecular, pharmacological and physiological processes).

Particular attention will be placed on advances in NLP and ML that are applicable to biomolecular and biomedical studies, and clinical science.

Please note the following.

  • These are solely presented as my personal summary notes, not a research paper: my intent here is to summarize the recent literature relevant to the subject areas indicated in the Table of Contents. While this REVIEW is comprehensive, it is not an exhaustive survey of that literature, as it reflects my personal interests.

  • The terms “machine learning” and “neural networks” are used interchangeably.

  • For convenience I will often repeat key abbreviation definitions in various subsections. Also, I generally do not pluralize abbreviations (e.g., RNN, not RNNs ) unless not doing so leads to ambiguity.

  • I will paraphrase references inline with relevant URLs provided, generally forgoing the use of author names etc. but occasionally mentioning key individuals and dates.

  • Internal references to other parts of this REVIEW (and my Glossary) are presented as green hyperlinks.

  • Much less frequently you will encounter orange-brown hyperlinks, which are “mouseover” images (supplementary content that I think is important, but I do not want to have prominently displayed unless you move your cursor over that link). Example.
    Source

  • Please refer to my Glossary (and this glossary) for descriptions of some of the terms and concepts used in this REVIEW.

A recent review (Stanford University, July 2018) that provides an excellent introduction and overview of many of the topics discussed below is Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities.

[Table of Contents]

NATURAL LANGUAGE PROCESSING

Natural language processing (NLP), a branch of machine learning (ML), is foundational to all information extraction and natural language tasks. Recent reviews of NLP relevant to this TECHNICAL REVIEW include:

Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on arXiv.org, my RSS feeds, and other sources. I firmly believe that this rapid pace of progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).

Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress  (alternate link).

Basic steps associated with NLP include text retrieval and preprocessing steps, including:

Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.

Some recent ML approaches to NLP tasks include:

Again, that is not an exhaustive list – merely some articles that I have recently encountered that are relevant to my interests.

[Table of Contents]

NLP: Selected Papers

Cross-sentence $\small n$-ary relation extraction detects relations among $\small n$ entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG [directed acyclic graph], adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM  [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given

“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”

… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state of the art system of Peng et al. (2017) by 1.2%.

Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified with regard to the edge labels (discussed by Song et al. in their paper).

arxiv-1808.09101a.png

[Image source. Click image to open in new window.]


arxiv-1808.09101b.png

[Image source. Click image to open in new window.]


/files/misc/arxiv-1808.09101c.png

[Image source. Click image to open in new window.]


arxiv-1808.09101d.png

[Image source. Click image to open in new window.]


Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction (Nov 2018) proposed a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploited word embeddings and positional embeddings for cross-sentence $\small n$-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture the most informative features, which are essential for cross-sentence $\small n$-ary relation extraction. Their LSTM-CNN model was evaluated on standard datasets for cross-sentence $\small n$-ary relation extraction, where it significantly outperformed baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also showed that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence $\small n$-ary relation extraction.

  • “However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in $\small \text{LISTING 1}$, there exists a ternary relation response across three entities: $\small \text{EGFR}$ , $\small \text{L858E}$ , $\small \text{gefitnib}$ appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence $\small n$-ary relation extraction.

      $\small \text{Listing 1: Text span of two consecutive sentences}$

      'The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10.   All patients were treated with gefitnib and showed a partial response. '

    “This paper focuses on the cross-sentence $\small n$-ary relation extraction task. Formally, let $\small \{e_1, \ldots ,e_n\}$ be the set of entities in a text span $\small S$ containing $\small t$ number of consecutive sentences. For example, in the text span comprising 2 sentences ($\small t = 2$) in $\small \text{Listing 1}$ above, given cancer patients with mutation $\small v$ (EGFR) in gene $\small g$ (L858E), the patients showed a partial response to drug $\small d$ (gefitnib). Thus, a ternary relation response ( $\small \text{EGFR}$, $\small \text{L858E}$, $\small \text{gefitnib}$) exists among the three entities spanning across the two sentences in $\small \text{Listing 1}$.”

arxiv1811.00845-f1.png

[Image source. Click image to open in new window.]


arxiv1811.00845-t4+t5.png

[Image source. Click image to open in new window.]


Likewise, Neural Relation Extraction Within and Across Sentence Boundaries (Oct 2018) proposed an architecture for relation extraction in entity pairs spanning multiple sentences: inter-sentential dependency-based neural networks (iDepNN). iDepNN modeled the shortest and augmented dependency paths via recurrent and recursive neural networks to extract relationships within (intra-) and across (inter-) sentence boundaries. Compared to SVM and neural network baselines, iDepNN was more robust to false positives in relationships spanning sentences. The authors evaluated their model on four datasets from newswire (MUC6) and medical (BioNLP shared tasks) domains, that achieved state of the art performance and showed a better balance in precision and recall for inter-sentential relationships – performing better than 11 teams participating in the BioNLP Shared Task 2016, achieving a gain of 5.2% (0.587 vs 0.558) in $\small F_1$ over the winning team. They also released the crosssentence annotations for MUC6.

arxiv1810.05102a.png

[Image source. . Click image to open in new window.]


arxiv1810.05102b.png

[Image source. . Click image to open in new window.]


arxiv1810.05102c.png

[Image source. . Click image to open in new window.]


Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. They showed that their model was able to capture features and interactions (the model was robust in handling both overlapping and non-overlapping mentions) that could not be captured by previous models, while maintaining a low time complexity for inference.

In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (above), Learning to Recognize Discontiguous Entities (Oct 2018) focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.

Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. GraphIE: A Graph-Based Framework for Information Extraction (Oct 2018) is a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagated information between connected nodes through graph convolutions and exploited the richer representation to improve word level predictions. Results on three different tasks – social media, textual and visual information extraction – showed that GraphIE outperformed a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.

arxiv1810.13083-f1.png

[Image source. Click image to open in new window.]


arxiv1810.13083-f2.png

[Image source. Click image to open in new window.]


arxiv1810.13083-t5+t6.png

[Image source. Click image to open in new window.]


While character-based neural models have proven useful for many NLP tasks, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) first investigated the gaps between methods for learning word and sentence representations. Furthermore, they proposed IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. They evaluated their model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. IntNet significantly outperformed other character embedding models and obtained new state of the art performance without relying on any external knowledge or resources.

arxiv1810.12443-f1.png

[Image source. Click image to open in new window.]


arxiv1810.12443-t1.png

[Image source. Click image to open in new window.]


arxiv1810.12443-t2.png

[Image source. Click image to open in new window.]


arxiv1810.12443-t3+t4.png

[Image source. Click image to open in new window.]


Natural Language Processing:

Additional Reading

  • Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016) [codediscussion]

    “Relation classification is an important semantic processing task in the field of natural language processing (NLP). State of the art systems still rely on lexical resources such as WordNet or NLP systems like dependency parser and named entity recognizers (NER) to get high-level features. Another challenge is that important information can appear at any position in the sentence. To tackle these problems, we propose Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM) to capture the most important semantic information in a sentence. The experimental results on the SemEval-2010 relation classification task show that our method outperforms most of the existing methods, with only word vectors.”

    Zhou2016attention-f1.png

    [Image source. Click image to open in new window.]

    Zhou2016attention-t1.png

    [Image source. Click image to open in new window.]

  • Gradual Machine Learning for Entity Resolution (Oct 2018)

    “Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning [GML], which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.”

    arxiv1810.12125-f1+f4.png

    [Image source. Click image to open in new window.]


    arxiv1810.12125-t3.png

    [Image sourceGML: gradual machine learning (this paper); UR: unsupervised rule-based; UC: unsupervised clustering; SVM: support vector machine; DNN: Deep Learning.  Click image to open in new window.]

    Our evaluation is conducted on three real datasets, which are described as follows:

    • DS (DBLP-Scholar 3): The DS dataset contains the publication entities from DBLP and the publication entities from Google Scholar. The experiments match the DBLP entries with the Scholar entries.
    • AB (Abt-Buy 4): The AB dataset contains the product entities from both Abt.com and Buy.com. The experiments match the Abt entries with the Buy entries.
    • SG (Songs 5): The SG dataset contains song entities, some of which refer to the same songs. The experiments match the song entries in the same table.
  • Neural CRF Transducers for Sequence Labeling (Nov 2018)

    “Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.”

    arxiv1811.01382-t1+f1.png

    [Image source. Click image to open in new window.]


    arxiv1811.01382-t2+t3+t4+t5.png

    [Image source. Click image to open in new window.]
  • Comparison of Named Entity Recognition Methodologies in Biomedical Documents (Nov 2018)

    “Background. Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers.
    Results. Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan-type and Elman-type algorithms have $\small F_1$ scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and Word2Vec have $\small F_1$ of 72.73%, 72.74%, and 72.82%, respectively.”

    “In this paper, we use five categories (protein, DNA, RNA, cell type, and cell line) instead of the categories used in the ordinary NER process. An example of the NER tagged sentence is as follows: ‘IL-2 [ B-protein ] responsiveness requires three distinct elements [ B-DNA ] within the enhancer [ B-DNA ].’

    PMID30396340-f1+f4+f5+t2.png

    [Image source. Click image to open in new window.]

[Table of Contents]

NATURAL LANGUAGE UNDERSTANDING

Machine learning is particularly well suited to assisting and even supplanting many standard NLP approaches (for a good review see Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities (Jun 2018)). Language models, for example, provide improved understanding of the semantic content and latent (hidden) relationships in documents. Machine based natural language understanding (NLU) is a fundamental requirement for robust, human level performance in tasks such as information retrieval, text summarization, question answering, textual entailment, sentiment analysis, reading comprehension, commonsense reasoning, recommendation, etc.

arxiv-1807.00123a.png

[Image source. Click image to open in new window.]


arxiv-1807.00123b.png

[Image source. Click image to open in new window.]


arxiv-1807.00123c.png

[Image source. Click image to open in new window.]


Advances in NLU offer tremendous promise for the analysis of biomedical and clinical text, which due to the use of technical, domain-specific jargon is particularly challenging for traditional NLP approaches. Some of these challenges and difficulties are described in the August 2018 post NLP’s Generalization Problem, and How Researchers are Tackling It  [discussion].

Recent developments in NLP and ML that I believe are particularly important to advancing NLU include:

  • understanding the susceptibility of QA systems to adversarial challenge;

  • the development of deeply-trained/pretrained language models;

  • transfer learning and multitask learning;

  • reasoning over graphs;

  • the development of more advanced memory and attention-based architectures; and,

  • incorporating external memory mechanisms; e.g., a differentiable neural computer, which is essentially an updated version of a neural Turing machine (What Is the Difference between Differentiable Neural Computers and Neural Turing Machines?). Relational database management systems (RDBMS), textual knowledge stores (TKS) and knowledge graphs (KG) also represent external knowledge stores, that may possibly be leveraged as potential external memory resources of external memory architectures suitable for NLP and ML.

DeepMind’s recent paper Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies (Aug 2018) addressed preserving and reusing past knowledge (memory) via unsupervised representation learning using a variational autoencoder: VASE (Variational Autoencoder with Shared Embeddings). VASE automatically detected shifts in data distributions and allocated spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting:

    "... thanks to learning a generative model of the observed environments, we can prevent **catastrophic forgetting** by periodically "hallucinating" (i.e. generating samples) from past environments using a snapshot of VASE, and making sure that the current version of VASE is still able to model these samples. A similar "dreaming" feedback loop was used in Lifelong Generative Modeling, ..."

arxiv-1808.06508.png

[Image source. Click image to open in new window.]


  • For similar, prior work by other authors (cited) that also used a variational autoencoder, see Lifelong Generative Modeling, below.

  • As noted, each of the papers cited above addressed the issue of catastrophic forgetting. Interestingly, the Multitask Question Answering Network (MQAN), described in Richard Socher’s “decaNLP/MQAN” paper, attained robust multitask learning, performing nearly as well or better in the multitask setting as in the single task setting for each task despite being capped at the same number of trainable parameters in both. … This suggested that MQAN successfully used trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limited catastrophic forgetting.

Lifelong learning is the problem of learning multiple consecutive tasks in a sequential manner where knowledge gained from previous tasks is retained and used for future learning. It is essential towards the development of intelligent machines that can adapt to their surroundings. Lifelong Generative Modeling (Sep 2018), by authors at the University of Geneva and the Geneva School of Business Administration, focused on a lifelong learning approach to generative modeling where we continuously incorporate newly observed distributions into their learnt model. We did so through a student-teacher variational autoencoder architecture which allowed them to learn and preserve all the distributions seen to that point without the need to retain the past data nor the past models. Through the introduction of a novel cross-model regularizer, inspired by a Bayesian update rule, the student model leveraged the information learnt by the teacher, which acted as a summary of everything seen to that point. The regularizer had the additional benefit of reducing the effect of catastrophic interference that appears when sequences of distributions are learned. They demonstrated its efficacy in learning sequentially observed distributions as well as its ability to learn a common latent representation across a complex transfer learning scenario.

arxiv1705.09847-f1.png

[Image source. Click image we learn over to open in new window.]


arxiv1705.09847-f2.png

[Image source. Click image we learn over to open in new window.]


Continual learning is the ability to sequentially learn over time by accommodating knowledge while retaining previously learned experiences. Neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on previously learned tasks when tasks are presented one at a time. This problem is called catastrophic forgetting. Continual Classification Learning Using Generative Models (Oct 2018) by authors at the University of Geneva and the Geneva School of Business Administration propose a classification model that learns continuously from sequentially observed tasks, while preventing catastrophic forgetting. We build on [our previous work on] the lifelong generative capabilities of Lifelong Generative Modeling and extend it to the classification setting by deriving a new variational bound on the joint log likelihood, $\small log p(x,y)$.

arxiv1810.10612-f1.png

[Image source. Click image to open in new window.]


arxiv1810.10612-f2.png

[Image source. Click image to open in new window.]


Google Brain’s A Simple Method for Commonsense Reasoning (Jun 2018) [codeslidesdiscussiondiscussion] presented a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to the method was the use of an array of large RNN language models that operated at word or character level, trained on a massive amount of unlabeled data, to score multiple choice questions posed by commonsense reasoning tests.

arxiv-1806.02847.png

[Image source. Click image to open in new window.]


  • This paper was subsequently savaged in an October, 2018 commentary, A Simple Machine Learning Method for Commonsense Reasoning? A Short Commentary on Trinh & Le (2018):

    A Concluding Remark. The data-driven approach in AI has without a doubt gained considerable notoriety in recent years, and there are a multitude of reasons that led to this fact. While the data-driven approach can provide some useful techniques for practical problems that require some level of natural language processing (text classification and filtering, search, etc.), extrapolating the relative success of this approach into problems related to commonsense reasoning, the kind that is needed in true language understanding, is not only misguided, but may also be harmful, as this might seriously hinder the field, scientifically and technologically.”

A Simple Neural Network Module for Relational Reasoning (Jun 2017) [DeepMind blog;  non-author code here and here;  discussion herehere and here] by DeepMind described Relation Networks, a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning including visual question answering, text-based question answering using the bAbI suite of tasks, and complex reasoning about dynamic physical systems. They showed that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with relational networks, to implicitly discover and learn to reason about entities and their relations.

arxiv-1706.01427.png

[Image source. Click image to open in new window.]


While Relation Networks – introduced by Santoro et al. (2017) [DeepMind’s “A Simple Neural Network Module for Relational Reasoning,” above – demonstrated strong relational reasoning capabilities, its rather shallow architecture (a single-layer design) only considered pairs of information objects making it unsuitable for problems requiring reasoning across a higher number of facts. To overcome this limitation, authors at the University of Lübeck presented proposed Multi-layer Relation Networks (Nov 2018) [code], a multi-layer relation network architecture which enabled successive refinements of relational information through multiple layers. They showed that the increased depth allowed for more complex relational reasoning, by applying it to the bAbI 20 QA dataset, solving all 20 tasks with joint training and surpassing the state of the art results.

arxiv1811.01838-f1+f2.png

[Image source. Click image to open in new window.]


arxiv1811.01838-t1.png

[Image source. Click image to open in new window.]


arxiv1811.01838-tA1.png

[Image source. Click image to open in new window.]


Natural Language Understanding:

Additional Reading

  • On the Evaluation of Common-Sense Reasoning in Natural Language Understanding (Nov 2018) [datasets]

    “The NLP and ML communities have long been interested in developing models capable of common-sense reasoning, and recent works have significantly improved the state of the art on benchmarks like the Winograd Schema Challenge (WSC). Despite these advances, the complexity of tasks designed to test common-sense reasoning remains under-analyzed. In this paper, we make a case study of the Winograd Schema Challenge and, based on two new measures of instance-level complexity, design a protocol that both clarifies and qualifies the results of previous work. Our protocol accounts for the WSC’s limited size and variable instance difficulty, properties common to other common-sense benchmarks. Accounting for these properties when assessing model results may prevent unjustified conclusions.”


[Table of Contents]

Word Embeddings

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers (Sebastian Ruder provides a good overview; see also this excellent post, Introduction to Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov et al. (Sep 2013) Efficient Estimation of Word Representations in Vector Space – the “word2vec” paper].

arxiv-1301.3781.png

[Image source. Click image to open in new window.]


cbo_vs_skipgram.png

[Image source. Click image to open in new window.]


Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g.,

    vBerlin - vGermany + vFrance ~ vParis
    vking - vman + vwoman ~ vqueen.

The original work
Source
for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. ”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes. ” The full reference is: Y. Bengio et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – viz-a-viz my previous comment (I’m being rather critical here) – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.

For a theoretical discussion of word vectors, see Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline  [codediscussion], which is a critique/extension of A Latent Variable Model Approach to PMI-based Word Embeddings. In addition to proposing a new generative model – a dynamic version of the log-linear topic model of Mnih and Hinton (2007) [Three New Graphical Models for Statistical Language Modelling] – the paper provided a theoretical justification for nonlinear models like PMI, word2vec, and GloVe. It also helped explain why low dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013)  [see the algebraic examples, above]. Experimental support was provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.

Related, Sebastion Ruder recently provided a summary of ACL 2018 highlights, including a subsection entitled Understanding Representations: “It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture.”

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New Perspectives (Apri 2014); see also the excellent blog posts Deep Learning, NLP, and Representations by Chris Olah, and An introduction to representation learning by Michael Alcorn). Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning).

In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

  • the meanings of words;
  • how words are combined to form concepts (i.e., syntax); and,
  • how concepts relate to the task at hand.

Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference (Oct 2018) [discussion] by the Paul G. Allen School of Computer and Science Engineering, and Facebook AI Research, proposed new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. Their pairwise embeddings were computed as a compositional function of each word’s representation, which was learned by maximizing the pointwise mutual information (PMI) with the contexts in which the two words co-occurred. They added these representations to the cross-sentence attention layer of existing inference models (e.g. BiDAF for QA, ESIM for NLI), instead of extending or replacing existing word embeddings. Experiments showed a gain of 2.72% on the recently released SQuAD2.0 and 1.3% on MultiNLI. Our representations also aid in better generalization with gains of around 6-7% on adversarial SQuAD datasets, and 8.8% on the adversarial entailment test set by Glockner et al.

arxiv1810.08854-t1.png

[Image source. Click image to open in new window.]


arxiv1810.08854-f1.png

[Image source. Click image to open in new window.]


arxiv1810.08854-t2+t3+t4.png

[Image source. Click image to open in new window.]


arxiv1810.08854-t1.png

[Image source. Click image to open in new window.]


Word Embeddings:

Additional Reading

  • Towards Understanding Linear Word Analogies (Oct 2018)

    • “A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as Skip-Gram with Negative Sampling (SGNS). We provide a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. Our theory has several implications. Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.”

    • [Section 5] Even though vector algebra is surprisingly effective at solving word analogies, the csPMI Theorem reveals two reasons for why an analogy may be unsolvable in a given embedding space: polysemy and corpus bias. …

  • Dynamic Meta-Embeddings for Improved Sentence Representations (Kyunghyun Cho and colleagues at Facebook AI Research; Sep 2018) [projectcodediscussion]

    • “While one of the first steps in many NLP systems is selecting what pre-trained word embeddings to use, we argue that such a step is better left for neural networks to figure out by themselves. To that end, we introduce dynamic meta-embeddings, a simple yet effective method for the supervised learning of embedding ensembles, which leads to state-of-the-art performance within the same model class on a variety of tasks. We subsequently show how the technique can be used to shed new light on the usage of word embeddings in NLP systems.”

      “We argue that the decision of which word embeddings to use in what setting should be left to the neural network. While people usually pick one type of word embeddings for their NLP systems and then stick with it, we find that dynamically learned meta-embeddings lead to improved results. In addition, we showed that the proposed mechanism leads to better interpretability and insightful linguistic analysis. We showed that the network learns to select different embeddings for different data, different domains and different tasks. We also investigated embedding specialization and examined more closely whether contextualization helps. To our knowledge, this work constitutes the first effort to incorporate multi-modal information on the language side of image-caption retrieval models; and the first attempt at incorporating meta-embeddings into large-scale sentence-level NLP tasks.”

    arxiv1804.07983-t1.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Addressing Hypernymy and Polysemy with Word Embeddings

Word embeddings have many uses in NLP. For example, polysemy – words or phrases with different, but related, meanings [e.g. “Washington” may refer to “Washington, DC” (location) or “George Washington” (person)] – pose one of many challenges to NLP. Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal.

In Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation (Jun 2016) [code] the authors offered a solution to the polysemy problem. They proposed a novel embedding method specifically designed for named entity disambiguation that jointly mapped words and entities into the same continuous vector space. Since similar words and entities were placed close to one another in vector space in this model, the similarity between any pair of items (e.g. words, entities, and a word and an entity) could be measured by simply computing their cosine similarity.

Though not cited in that paper, the code for that work (by coauthor and Studio Ousia employee Ikuya Yamada) was made available on GitHub in the Wikipedia2Vec repository; the Wikipedia2Vec project page contains the pretrained embeddings (models).

A probabilistic extension of fastText (discussed elsewhere in this REVIEW) – Probabilistic FastText for Multi-Sense Word Embeddings (Jun 2018) (also discussed elsewhere in this REVIEW) – can produce accurate representations of rare, misspelt, and unseen words. Probabilistic FastText achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words:

  • “Our multimodal word representation can also disentangle meanings, and is able to separate different senses in foreign polysemies. In particular, our models attain state-of-the-art performance on SCWS, a benchmark to measure the ability to separate different word meanings, achieving 1.0% improvement over a recent density embedding model W2GM (Athiwaratkun and Wilson, 2017). To the best of our knowledge, we are the first to develop multi-sense embeddings with high semantic quality for rare words.”

  • “… we show that our probabilistic representation with subword mean vectors with the simplified energy function outperforms many word similarity baselines and provides disentangled meanings for polysemies.”

  • “We show that our embeddings learn the word semantics well by demonstrating meaningful nearest neighbors
    Source
    meaningful nearest neighbors. Table 1 shows examples of polysemous words such as ‘rock’, ‘star’, and ‘cell’. Table 1
    Source
    shows the nearest neighbors of polysemous words. We note that subword embeddings prefer words with overlapping characters as nearest neighbors. For instance, ‘rock-y’, ‘rockn’, and ‘rock’ are both close to the word ‘rock’. For the purpose of demonstration, we only show words with meaningful variations and omit words with small character-based variations previously mentioned. However, all words shown are in the top-100 nearest words. We observe the separation in meanings for the multi-component case; for instance, one component of the word ‘bank’ corresponds to a financial bank whereas the other component corresponds to a river bank. The single-component case also has interesting behavior. We observe that the subword embeddings of polysemous words can represent both meanings. For instance, both ‘lava-rock’ and ‘rock-pop’ are among the closest words to ‘rock’.”

Wasserstein is All you Need (Aug 2018) [discussion] proposed a unified framework for building unsupervised representations of individual objects or entities (and their compositions), by associating with each object both a distributional as well as a point estimate (vector embedding). Their method gives a novel perspective for building rich and powerful feature representations that simultaneously capture uncertainty (via a distributional estimate) and interpretability (with the optimal transport map). Among their various applications (e.g. entailment detection; semantic similarity), they proposed to represent sentences as probability distributions to better capture the inherent uncertainty and polysemy, arguing that histograms (or probability distributions) over embeddings allows the capture of more of this information than point-wise embeddings, alone. They discuss hypernymy detection in Section 7; for this purpose, they relied on a recently proposed model that which explicitly modeled what information is known about a word by interpreting each entry of the embedding as the degree to which a certain feature is present.

arxiv-1808.09663a.png

[Image source. Click image to open in new window.]


This image is particularly illustrative (click, and click again, to enlarge):

arxiv-1808.09663b.png

[Image source. Click image to open in new window.]


  • “While existing methods represent each entity of interest (e.g., a word) as a single point in space (e.g., its embedding vector), we here propose a fundamentally different approach. We represent each entity based on the histogram of contexts (co-occurring with it), with the contexts themselves being points in a suitable metric space. This allows us to cast the distance between histograms associated with the entities as an instance of the optimal transport problem [see Section 3 for a background on optimal transport]. For example, in the case of words as entities, the resulting framework then intuitively seeks to minimize the cost of moving the set of contexts of a given word to the contexts of another [note their Fig, 1]. Note that the contexts here can be words, phrases, sentences, or general entities co-occurring with our objects to be represented, and these objects further could be any type of events extracted from sequence data …”

  • Regarding semantic embedding, or word sense disambiguation (not explicitly discussed in the paper), their Fig.2 [Illustration of three words, each with their distributional estimates (left), as well as the point estimates of the relevant contexts (middle), as well as joint representation (right)] is very interesting: words in vector space, along with a histogram of their probability distributions over those embedded spaces.

  • “Software Release. We plan to make all our code (for all these parts) and our pre-computed histograms (for the mentioned datasets) publicly available on GitHub soon.”  [Not available: 2018-10-07]

Early in 2018 pretrained language models such as ELMo (Deep Contextualized Word Representations;  discussed elsewhere in this REVIEW) offered another approach to solve the polysemy problem.

[Table of Contents]

Word Sense Disambiguation

Related to polysemy and named entity disambiguation is word sense disambiguation (WSD). Learning Graph Embeddings from WordNet-based Similarity Measures (Aug 2018) (discussed elsewhere in this REVIEW]) described a new approach, path2vec, for learning graph embeddings that relied on structural measures of node similarities for generation of training data. Evaluations of the proposed model on semantic similarity and WSD tasks showed that path2vec yielded state of the art results.

In January 2018 Ruslan Salakhutdinov and colleagues proposed a probabilistic graphical model (discussed elsewhere in this REVIEW) that leveraged a topic model to design a WSD system (WSD-TM ) that scaled linearly with the number of words in the context (Knowledge-based Word Sense Disambiguation using Topic Models). Their logistic normal topic model – a variant of latent Dirichlet allocation in which the topic proportions for a document were replaced by WordNet  synsets
Source
(sets of synonyms) – incorporated semantic information about synsets as its priors. WSD-TM outperformed state of the art knowledge-based WSD systems.

[Table of Contents]

Probing the Role of Attention in Word Sense Disambiguation

Recent work has shown that the encoder-decoder attention mechanisms in neural machine translation (NMT) are different from the word alignment in statistical machine translation. An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation (Oct 2018) focused on analyzing encoder-decoder attention mechanisms, in the case of word sense disambiguation (WSD) in NMT models. They hypothesized that attention mechanisms pay more attention to context tokens when translating ambiguous words, and explored the attention distribution patterns when translating ambiguous nouns. Counterintuitively, they found that attention mechanisms were likely to distribute more attention to the ambiguous noun itself rather than context tokens, in comparison to other nouns. They concluded that the attention mechanism was not the main mechanism used by NMT models to incorporate contextual information for WSD. The experimental results suggested that NMT models learned to encode the contextual information necessary for WSD in the encoder hidden states. For the attention mechanism in Transformer models, they revealed that the first few layers gradually learn to “align” source and target tokens, and the last few layers learn to extract features from the related but unaligned context tokens.

arxiv1810.07595-f1.png

[Image source. Click image to open in new window.]


arxiv1810.07595-f2.png

[Image source. Click image to open in new window.]


arxiv1810.07595-t3.png

[Image source. Click image to open in new window.]


[Table of Contents]

Applications of Embeddings in the Biological Sciences

While predicting protein 3D structure from primary amino acid sequences has been a long-standing objective in bioinformatics, definite solutions remain to be found  [discussion]. The most reliable approaches currently available involve homology modeling, which allows assigning a known protein structure to an unknown protein, provided that there is detectable sequence similarity between the two. When homology modeling is not viable de novo techniques, based on physical-based potentials or knowledge-based potentials, are needed. Unfortunately proteins are very large molecules, and the huge amount of available conformations, even for relatively small proteins, makes it prohibitive to fold them even on customized computer hardware.

To address this challenge, knowledge based potentials can be learned from statistics or machine learning methods to infer useful information from known examples of protein structures. This information can be used to constrain the problem, greatly reducing the amount of samples that need to be evaluated when dealing exclusively with physics-based potentials. Multiple sequence alignments (MSA) consists of aligned sequences homologous to the target protein, compressed into position-specific scoring matrices (PSSM, also called sequence profiles) using the fraction of occurrences of different amino acids in the alignment for each position in the sequence. More recently, contact map prediction methods have been at the center of renewed interest; however, their impressive performance is correlated with the amount of sequences in the MSA, and is not as reliable when few sequences are related to the target.

rawMSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete introduced a new approach, called rawMSA, for the de novo prediction of structural properties of proteins. The core idea behind rawMSA was to borrow from the word2vec word embedding from Mikolov et al. (Efficient Estimation of Word Representations in Vector Space), which they used to convert each character (amino acid residue) in the MSA into a floating point vector of variable size, thus representing the residues by the structural property they were trying to predict. Test results from deep neural networks based on this concept showed that rawMSA matched or outperformed the state of the art on three tasks: predicting secondary structure, relative solvent accessibility, and residue-residue contact maps.

[Table of Contents]

Probing the Effectiveness of Word Embeddings

A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why – and when – linear operators correspond to non-linear embedding models such as Skip-Gram with Negative Sampling (SGNS). Towards Understanding Linear Word Analogies (Oct 2018) provided a rigorous explanation of this phenomenon without making the strong assumptions that past work has made about the vector space and word distribution. “Past work has often conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel theoretical justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, providing rigorous justification for its use in capturing word dissimilarity.”

  • “In this paper, we provided a rigorous explanation of why - and when - word analogies can be solved using vector algebra. More specifically, we proved that an analogy holds over a set of word pairs in an SGNS or GloVe embedding space with no reconstruction error i.f.f.  the co-occurrence shifted PMI is the same for every word pair. Our theory had three implications. … [See comments below.] … Most importantly, our theory did not make the unrealistic assumptions that past theories have made about the word distribution and vector space, making it much more tenable than previous explanations.”

  • [discussion: Hacker News]: “Hi, first author here! Feel free to ask any questions. TL;DR: We prove that linear word analogies hold over a set of ordered pairs (e.g., $\small \text{{(Paris, France), (Ottawa, Canada), …})}$ in an SGNS or GloVe embedding space with no reconstruction error when $\small \text{PMI(x,y) + log p(x,y)}$ is the same for every word pair $\small \text{(x,y)}$. We call this term the csPMI (co-occurrence shifted PMI). This has a number of interesting implications:

    1. It implies that Pennington et al. [Socher; Manning] (authors of GloVe) had the right intuition about why these analogies hold.
    2. Adding two word vectors together to compose them makes sense, because you’re implicitly downweighting the more frequent word – like TF-IDF or SIF would do explicitly.
    3. Using Euclidean distance to measure word dissimilarity make sense because the Euclidean distance is a linear function of the negative csPMI.”

[Table of Contents]

Memory Based Architectures

For a more detailed description of how neural networks “learn” see my blog post How do Neural Networks "Remember"? In essence, the answer is that memory forms during the training of the parameters (i.e., the trained weights); the matrix of trained weights are the memory.



Memory (the ability to recall previous facts and knowledge) is a crucial requirement for natural language understanding, reasoning (the process of forming an answer a new question by manipulating previously acquired knowledge), and the guidance of decision making. Without memory, agents must act reflexively according only to their immediate percepts and cannot execute plans that occur over an extended time intervals (Neural Map: Structured Memory for Deep Reinforcement Learning).

Broadly speaking, computational approaches to memory include:

  • internal, volatile “short-term” memories algorithmically generated within RNN, LSTM, and self-attention modules;

  • external, volatile memories algorithmically generated by neural Turing machines, memory networks, and differential neural computers; and

  • external, permanent long-term “memories” embedded within knowledge bases and knowledge graphs (for relevant discussion, see my Text Grounding: Mapping Text to Knowledge Graphs; External Knowledge Lookup subsection).

Neural Architectures with Memory  [local copy] provides an excellent overview of neural memory architectures.

Short term memory architectures are commonly employed in the various models discussed in this REVIEW. For example, RNNLSTM, dynamic memory networks (DMN), etc. serve as “working memory” in summarization, question answering and other tasks. Long short-term memory (LSTM) networks are a specialized type of recurrent neural network (RNN) that are capable of learning long term dependencies as well as short term memories of recent transactions.

However, most machine learning models lack an easy way to read and write to part of a (potentially very large) long-term memory component, and to combine this seamlessly with inference. While RNN can be trained to predict the next word to output after reading a stream of words, their memory (encoded by hidden states and weights) is typically too small and is not compartmentalized enough to accurately remember facts from the past (as the knowledge is compressed into dense vectors, from which those memories are not easily accessed). RNNs are also known to have difficulty in performing memorization, for example the simple copying task of outputting the same input sequence they have just read.

Neural networks that utilize external memories can be classified into two main categories: memories with write operators, and those without (Neural Map: Structured Memory for Deep Reinforcement Learning). Regarding the latter type, memory networks (MemNN, introduced by Jason Weston et al. at Facebook AI Research and discussed below) are a class of deep networks that jointly learn how to reason with inference components combined with a long-term memory component that can be written to and read from, with the goal of using it for prediction. Instead of using a recurrent matrix to retain information through time, memory networks learn how to operate effectively with the memory component.

Memory networks employ explicit addressable memory, that fixes which memories are stored. For example, at each time step, the memory network would store the past $\small M$ states that have been seen in an environment. Therefore, what is learned by the network is how to access or read from this fixed memory pool, rather than what contents to store within it. In sidestepping the difficulty of learning what information to store in memory, memory networks introduce two main disadvantages: storing a potentially significant amount of redundant information; and, relying on domain experts to choose what to store in the memory (Neural Map: Structured Memory for Deep Reinforcement Learning). The memory network approach has been successful in language modeling and question answering, and was shown to be a successful memory for deep reinforcement learning agents in complex 3D environments (Neural Map: Structured Memory for Deep Reinforcement Learning and references therein).

  • Tracking the World State with Recurrent Entity Networks (May 2017) [OpenReview; non-author code here and here], by Jason Weston and Yann LeCun, introduced the Recurrent Entity Network (EntNet). EntNet was equipped with a dynamic long-term memory, which allowed it to maintain and update a representation of the state of the world as it received new data. For language understanding tasks, it could reason on the fly as it read text, not just when it was required to answer a question or respond, as was the case for the MemN2N memory network (Weston’s End-To-End Memory Networks, discussed elsewhere in this REVIEW). Like a neural Turing machine or differentiable neural computer, EntNet maintained a fixed size memory and could learn to perform location and content-based read and write operations. However, unlike those models, it had a simple parallel architecture in which several memory locations could be updated simultaneously. EntNet set a new state of the art on the bAbI tasks, and was the first method to solve all the tasks in the 10k training examples setting. Weston and LeCun also demonstrated that EntNet could solve a reasoning task which required a large number of supporting facts, which other methods were not able to solve, and could generalize past its training horizon.

In contrast to memory networks, external neural memories having write operations are potentially far more efficient, since they can learn to store salient information for unbounded time steps and ignore any other useless information, without explicitly needing any knowledge a priori on what to store. A prominent research direction on write-based architectures has been recurrent architectures that mimic computer memory systems that explicitly separate memory from computation, analogous to how a CPU (processor/controller) interacts with an external memory (tapes; RAM; GPU) in digital computers. One such model, the Differentiable Neural Computer (DNC) – and its predecessor the Neural Turing Machine (NTM) – structure the architecture to explicitly separate memory from computation. The DNC has a recurrent neural controller that can access an external memory resource by executing differentiable read and write operations. This allows the DNC to act and memorize in a structured manner resembling a computer processor, where read and write operations are sequential and data is store distinctly from computation. The DNC has been used successfully to solve complicated algorithmic tasks, such as finding shortest paths in a graph or querying a database for entity relations.

NTM-DNC.png

There has been extensive work in the NLP domain regarding the use of neural Turing machines  (NTM), and to a lesser extent, differentiable neural computers  (DNC). For a slightly dated (current to ~2017) summary listing of NTM and DNC, see my web page (this is a huge file: on slow connections, wait for the page to fully load). Notable, among those papers are the following items.

  • Survey of Reasoning using Neural Networks (Mar 2017) provided an excellent summary (including relevant background) of neural network approaches to reasoning and inference with a focus on the need for memory networks (e.g. the MemN2N end-to-end memory network, discussed elsewhere in this REVIEW) and large external memories. Among the algorithms surveyed and compared were a LSTM, a NTM with a LSTM controller, and a NTM with a feedforward controller (demonstrating the superior performance of the NTM over the LSTM).

  • Robust and Scalable Differentiable Neural Computer for Question Answering (Jul 2018) (discussed elsewhere in this REVIEW) was designed as a general problem solver which could be used in a wide range of tasks. Their GitHub repository contains an implementation of a Advanced Differentiable Neural Computer (ADNC), providing more robust and scalable use in Question Answering.

    arxiv-1807.02658c.png

    [Image source. Click image to open in new window.]


LSTM were used in Augmenting End-to-End Dialog Systems with Commonsense Knowledge (Feb 2018), which investigated the impact of providing commonsense knowledge about concepts (integrated as external memory) on human-computer conversation. Their method was based on a NIPS 2015 workshop paper, Incorporating Unstructured Textual Knowledge Sources into Neural Dialogue Systems, which described a method to leverage additional information about a topic using a simple combination of hashing and TF-IDF to quickly identify the most relevant portions of text from the external knowledge source, based on the current context of the dialogue. In that work, three recurrent neural networks (RNNs) were trained: one to encode the selected external information, one to encode the context of the conversation, and one to encode a response to the context. Outputs of these modules were combined to produce the probability that the response was the actual next utterance given the context.

[Table of Contents]

Attention and Memory

Jason Weston et al. (Facebook AI Research) introduced Memory Networks (MemNN) in Oct 2014 (updated Nov 2015).

arxiv-1410.3916.png

[Image source. Click image to open in new window.]


Although that paper lacked a schematic, the memory network architecture is well described in the paper, and this image:

memory_network

["memory network (MemNN)" (image source; click image to open in new window)]


  • A memory network consists of a memory $\small \mathbf{m}$ (an array of objects (for example an array of vectors or an array of strings) indexed by $\small \mathbf{m}_i$) and four (potentially learned) components $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$ and $\small \mathbf{R}$ as follows:

    • $\small \mathbf{I}$ (input feature map): converts the incoming input to the in ternal feature representation.
    • $\small \mathbf{G}$ (generalization): updates old memories given the new input. We call this generalization as there is an opportunity for the network to compress and generalize its memories at this stage for some intended future use.
    • $\small \mathbf{O}$ (output feature map): produces a new output (in the feature representation space), given the new input and the current memory state.
    • $\small \mathbf{R}$ (response): converts the output into the response format desired. For example, a textual response or an action.
  • $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$ and $\small \mathbf{R}$ can all potentially be learned components and make use of any ideas from the existing machine learning literature. In question answering systems, for example, the components may be instantiated as follows:

    • $\small \mathbf{I}$ can make use of standard pre-processing such as parsing, coreference, and entity resolution. It could also encode the input into an internal feature representation by converting from text to a sparse or dense feature vector.
    • The simplest form of $\small \mathbf{G}$ is to store $\small \mathbf{I}(\mathbf{x})$ in a “slot” in the memory: $\small \mathbf{m}_{\mathbf{H}(\mathbf{x})} = \mathbf{I}(\mathbf{x})$, where $\small \mathbf{H}(\cdot)$ is a function selecting the slot. That is, $\small \mathbf{G}$ updates the index $\small \mathbf{H}(\mathbf{x})$ of $\small \mathbf{m}$, but all other parts of the memory remain untouched.
      Restated yet again: the simplest form of $\small \mathbf{G}$ is to introduce a function $\small \mathbf{H}$ which maps the internal feature representation produced by $\small \mathbf{I}$ to an individual memory slot, and just updates the memory at $\small \mathbf{H(I(x))}$. question* $\small \mathbf{O}$ reads from memory and performs inference to deduce the set of relevant memories needed to perform a good response.
    • $\small \mathbf{R}$ would produce the actual wording of the question-answer based on the memories found by $\small \mathbf{O}$. For example, $\small \mathbf{R}$ could be an RNN conditioned on the output of $\small \mathbf{O}$
  • Note that the original memory network (MemNN, above) lacked an attention mechanism.

  • When the components $\small \mathbf{I}$, $\small \mathbf{G}$, $\small \mathbf{O}$, & $\small \mathbf{R}$ (above) were neural networks, the authors (Weston et al.) described the resulting system as a memory neural network (MemN2N – which they built for QA (question answering) problems.

The highly cited MemN2N architecture (End-To-End Memory Networks (Nov 2015) [code;  non-author code herehere and here;  discussion here and here]), introduced by Jason Weston and colleagues at Facebook AI Research, is a recurrent attention model over an external memory. The model involved multiple computational steps (termed “hops”) per output symbol. In this RNN architecture, the recurrence read from a possibly large external memory multiple times before outputting a symbol. The architecture was trained end-to-end and hence required significantly less supervision during training; the flexibility of the model allowed them to apply it to tasks as diverse as synthetic question answering and language modeling.

arxiv-1503.08895d.png

[Image source. Click image to open in new window.]


For question answering MemN2N was competitive with memory networks but with less supervision; for language modeling, MemN2N demonstrated performance comparable to RNN and LSTM on the Penn Treebank and Text8 datasets. In both cases they showed that the key concept of multiple computational hops yielded improved results. Unlike a traditional RNN, the average activation weight of memory positions during the memory hops did not decay exponentially: it had roughly the same average activation across the entire memory (Fig. 3 in the image, above), which may have been the source of the observed improvement in language modeling.

“We also vary the number of hops and memory size of our MemN2N, showing the contribution of both to performance; note in particular that increasing the number of hops helps. In Fig. 3, we show how MemN2N operates on memory with multiple hops. It shows the average weight of the activation of each memory position over the test set. We can see that some hops concentrate only on recent words, while other hops have more broad attention over all memory locations, which is consistent with the idea that successful language models consist of a smoothed n-gram model and a cache. Interestingly, it seems that those two types of hops tend to alternate. Also note that unlike a traditional RNN, the cache does not decay exponentially: it has roughly the same average activation across the entire memory. This may be the source of the observed improvement in language modeling.

MemN2N.png

[Image source. Click image to open in new window)]


Here is the MemN2N architecture, from the paper (End-To-End Memory Networks:

MemN2N-arxiv-1503.08895.png

[Image source. Click image to open in new window]


  • “Our model takes a discrete set of inputs x1, …, xn that are to be stored in the memory, a query q, and outputs an answer a. Each of the xi, q, and a contains symbols coming from a dictionary with V words. The model writes all x to the memory up to a fixed buffer size, and then finds a continuous representation for the x and q. The continuous representation is then processed via multiple hops to output a. This allows backpropagation of the error signal through multiple memory accesses back to the input during training.”

  • “… The entire set of {xi} are converted into memory vectors {mi} of dimension d computed by embedding each xi in a continuous space, in the simplest case, using an embedding matrix A (of size d × V). …”  ←  i.e., the vectorized input is stored as external memory

A recent paper from DeepMind (Relational Recurrent Neural Networks (Jun 2018) [code; discussion here and here] is also of interest with regard to language modeling and reasoning over natural language text. While memory based neural networks model temporal data by leveraging an ability to remember information for long periods, it is unclear whether they also have an ability to perform complex relational reasoning with the information they remember. In this paper the authors first confirmed their intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected (i.e., tasks involving relational reasoning). They then improved upon these deficits by using a new memory module, a Relational Memory Core (RMC; Fig. 1 in that paper), which showed large gains in reinforcement domains including language models (Sections 4.3 and 5.4).

arxiv-1806.01822a.png

[Image source. Click image to open in new window.]


arxiv-1806.01822d.png

[Image source. Click image to open in new window.]


  • Critique. While the DeepMind RMC model combined features of dynamic memory with an attention mechanism similar to Jason Weston’s DMN+ model (cited), they neither discuss nor compare the two models. Disappointingly, the DeepMind paper lacks ablation studies or other work needed to better understand their model: “… we cannot necessarily make any concrete claims as to the causal influence of our design choices on the model’s capacity for relational reasoning, or as to the computations taking place within the model and how they may map to traditional approaches for thinking about relational reasoning. Thus, we consider our results primarily as evidence of improved function – if a model can better solve tasks that require relational reasoning, then it must have an increased capacity for relational reasoning, even if we do not precisely know why it may have this increased capacity. ”

  • As shown in the first image, above, RMC module employs multi-head dot product attention (MHDPA) – Google’s Transformer seq2seq self-attention mechanism.

An aside regarding this DeepMind Relational Recurrent Neural Networks paper: another DeepMind paper, Relational Deep Reinforcement Learning  [discussion] (released at the same time) introduced an approach to deep reinforcement learning that improved upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It used the computationally efficient MHDPA self-attention model to iteratively reason about the relations between entities in a scene and to guide a model-free policy. In these models entity-entity relations are explicitly computed when considering the messages passed between connected nodes of the graph (i.e. the relations between entities in a scene). MHDPA computes interactions between those entities (attention weights); an (underlying) graph defines the path to a solution, with the attention weights driving the solution. [Very cool.]

arxiv-1806.01830d.png

[Image source. Click image to open in new window.]


  • This takes a minute to explain, but it’s a very neat game/task.

“Box-World” is a perceptually simple but combinatorially complex environment that requires abstract relational reasoning and planning. It consists of a 12 x 12 pixel room with keys and boxes randomly scattered. The room also contains an agent, represented by a single dark gray pixel, which can move in four directions: up, down, left, right. Keys are represented by a single colored pixel. The agent can pick up a loose key (i.e., one not adjacent to any other colored pixel) by walking over it. Boxes are represented by two adjacent colored pixels – the pixel on the right represents the box’s lock and its color indicates which key can be used to open that lock; the pixel on the left indicates the content of the box which is inaccessible while the box is locked.

To collect the content of a box the agent must first collect the key that opens the box (the one that matches the lock’s color) and walk over the lock, which makes the lock disappear. At this point the content of the box becomes accessible and can be picked up by the agent. Most boxes contain keys that, if made accessible, can be used to open other boxes. One of the boxes contains a gem, represented by a single white pixel. The goal of the agent is to collect the gem by unlocking the box that contains it and picking it up by walking over it. Keys that an agent has in possession are depicted in the input observation as a pixel in the top-left corner.

arxiv-1806.01830a.png

[Image source. Click image to open in new window.]


In each level there is a unique sequence of boxes that need to be opened in order to reach the gem. Opening one wrong box (a distractor box) leads to a dead-end where the gem cannot be reached and the level becomes unsolvable. There are three user-controlled parameters that contribute to the difficulty of the level: (1) the number of boxes in the path to the goal (solution length); (2) the number of distractor branches; (3) the length of the distractor branches. In general, the task is computationally difficult for a few reasons. First, a key can only be used once, so the agent must be able to reason about whether a particular box is along a distractor branch or along the solution path. Second, keys and boxes appear in random locations in the room, emphasising a capacity to reason about keys and boxes based on their abstract relations, rather than based on their spatial positions.

Figure 4 shows a trial run along with the visualization of the attention weights. For one of the attention heads, each key attends mostly to the locks that can be unlocked with that key. In other words, the attention weights reflect the options available to the agent once a key is collected. For another attention head, each key attends mostly to the agent icon. This suggests that it is relevant to relate each object with the agent, which may, for example, provide a measure of relative position and thus influence the agent’s navigation.

arxiv-1806.01830c.png

[Image source. Click image to open in new window.]


An Interpretable Reasoning Network for Multi-Relation Question Answering (Jun 2018) [code] is another very interesting paper which addressed multi-relation question answering via elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. They presented a novel Interpretable Reasoning Network (IRN) model that employed an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decided which part of an input question should be analyzed at each hop, for which the reasoning module predicted a knowledge base relation (relation triple) that corresponded to the current parsed result. The predicted relation was used to update the question representation as well as the state of the reasoning module, and helped the model to make the next-hop reasoning. At each hop, an entity was predicted based on the current state of the reasoning module.

arxiv-1801.04726a.png

[Image source. Click image to open in new window.]


arxiv-1801.04726b.png

[Image source. Click image to open in new window.]


arxiv-1801.04726c.png

[Image source. Click image to open in new window.]


  • IRN yielded state of the art results on two datasets. More interestingly and different from previous models, IRN offered traceable and observable intermediate predictions (see their Fig. 3), facilitating reasoning analysis and failure diagnosis (thereby also allowing manual manipulation in answer prediction). Whereas single-relation questions such as “How old is Obama?” can be answered by finding one fact triple in knowledge base/graph (this task has been widely studied), this work addressed multi-relation QA. Reasoning over multiple fact triples was required to answer multi-relation questions such as “Name a soccer player who plays at forward position at the club Borussia Dortmund.”, where more than one entity and relation are mentioned.

    On the datasets evaluated, IRN outperformed other baseline models such as Weston’s MemN2N model (see Table 2
    Source
    in the IRN paper). Through vector (space) representation, IRN could also establish reasonable mappings between knowledge base relations and natural language, such as linking “profession” to words like “working”, “profession”, and “occupation” (see their Table 4
    Source
    ), which addresses the issue of out-of-vocabulary (OOV) words.

Working memory is an essential component of reasoning – the process of forming an answer a new question by manipulating previously acquired knowledge. Memory modules are often implemented as a set of memory slots without explicit relational exchange of content, which does not naturally match multi-relational domains in which data is structured. Relational dynamic memory networks (Aug 2018) designed a new model, Relational Dynamic Memory Network (RDMN), to fill this gap. The memory could have single or multiple components, each of which realized a multi-relational graph of memory slots. The memory, dynamically updated in the reasoning process, was controlled by a central controller. The architecture is shown in their Fig. 1 (RDMN with single component memory): at the first step, the controller reads the query; the memory is initialized by the input graph, one node embedding per memory cell. Then during the reasoning process, the controller iteratively reads from and writes to the memory. Finally, the controller emits the output. RDMN performed well on several domains, including molecular bioactivity and chemical reactions.

  • Their Discussion provides an excellent summary (paraphrased here) that is relevant to this REVIEW:

    “The problem studied in this paper belongs to a broader program known as machine reasoning: unlike the classical focus on symbolic reasoning, here we aim for a learnable neural reasoning capability. We wish to emphasize that RDMN is a general model for answering any query about graph data. While the evaluation in this paper is limited to function calls graph, molecular bioactivity and chemical reaction, RDMN has a wide range of potential applications. For example, a drug (query) may act on the network of proteins as a whole (relational memory). In recommender systems, user can be modeled as a multi-relational graph (e.g., network between purchased items, and network of personal contacts); and query can be anything about them (e.g., preferred attributes or products). Similarly in healthcare, patient medical record can be modeled as multi-relational graphs about diseases, treatments, familial and social contexts; and query can be anything about the presence and the future of health conditions and treatments.”

arxiv-1808.04247a.png

[Image source. Click image to open in new window.]


arxiv-1808.04247b.png

[Image source. Click image to open in new window.]


Collectively, the works discussed above suggest that:

Memory-augmented neural networks such as MemN2N solve a compartmentalization problem with a slot-based memory matrix but may have a harder time allowing memories to interact/relate with one another once they are encoded, whereas LSTM pack all information into a common hidden memory vector, potentially making compartmentalization and relational reasoning more difficult (Relational Recurrent Neural Networks).

Denny Britz provided an excellent discussion of attention vs. memory in Attention and Memory in Deep Learning and NLP (Jan 2016). Also, Attention in Long Short-Term Memory Recurrent Neural Networks (Jun 2017) discussed a limitation of the LSTM-based encode-decoder architectures (i.e., fixed-length internal representations of the input sequence – note, e.g., ELMo) that attention mechanisms overcome: allowing the network to learn where to pay attention in the input sequence for each item in the output sequence.

<a name=1509.06664””></a> Particularly relevant to this REVIEW are the examples of attention in textual entailment (drawn from the DeepMind paper Reasoning about Entailment with Neural Attention (Mar 2016) [non-author code here and here]) and text summarization (drawn from Jason Weston’s A Neural Attention Model for Abstractive Sentence Summarization) – the benefits of which are immediately obvious upon reviewing that work.

arxiv-1509.06664a.png

[Image source. Click image to open in new window.]


arxiv-1509.06664b.png

[Image source. Click image to open in new window.]


Also relevant to this discussion, in Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (May 2018) researchers at the P.G. Allen School (University of Washington) discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of S-RNN (simple RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Thus, the LSTM gates themselves are powerful recurrent models that provide more representational power than previously realized. They noted that:

  1. The LSTM weights are vectors, while attention typically computes scalar weights; i.e., a separate weighted sum is computed for every dimension of the LSTM’s memory cell;

  2. The weighted sum is accumulated with a dynamic program. This enables a linear rather than quadratic complexity in comparison to self-attention, but reduces the amount of parallel computation. This accumulation also creates an inductive bias of attending to nearby words, since the weights can only decrease over time.

  3. Attention has a probabilistic interpretation due to the softmax normalization, while the sum of weights in LSTM can grow up to the sequence length. In variants of the LSTM that tie the input and forget gate, such as coupled-gate LSTM and GRU, the memory cell instead computes a weighted average with a probabilistic interpretation. These variants compute locally normalized distributions via a product of sigmoids rather than globally normalized distributions via a single softmax.

They concluded:

“Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicate that LSTMs suffer little to no performance loss when removing the S-RNN. This provides evidence that the gating mechanism is doing the heavy lifting in modeling context. We further ablate the recurrence in each gate and find that this incurs only a modest drop in performance, indicating that the real modeling power of LSTMs stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allows us to mathematically relate LSTMs and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enables the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.”




In the recent language modeling domain, whereas ELMo employs stacked Bi-LSTM, and ULMFiT employs stacked LSTM (with no attention, shortcut connections or other sophisticated additions), OpenAI’s Finetuned Transformer LM is a simple network architecture based solely on attention mechanisms that entirely dispenses with recurrence and convolutions, yet attains state of the art results. OpenAI’s Finetuned Transformer LM is based on Google’s Transformer architecture. Finetuned Transformer LM surpassed the state of the art on neural machine translation tasks, and generalized well to other tasks. The Transformer model, based entirely on attention, replaced RNN with a multi-head attention that consisted of multiple attention layers.

In July 2018, nearly a year after they introduced their original “Attention Is All You NeedTransformer architecture (Jun 2017; updated Dec 2017), Google Brain/DeepMind released an updated Universal Transformer version, discussed in the Google AI blog post Moving Beyond Translation with the Universal Transformer [Aug 2018;  discussion]:

arxiv-1807.03819b.png

[Image source. Click image to open in new window.]


arxiv-1807.03819a.png

[Image source (there is a more detailed schematic in Appendix A in that paper). Click image to open in new window.]


  • “In Universal Transformer [code, described in Tensor2Tensor for Neural Machine Translation] we extend the standard Transformer to be computationally universal (Turing complete) using a novel, efficient flavor of parallel-in-time recurrence which yields stronger results across a wider range of tasks. We built on the parallel structure of Transformer to retain its fast training speed, but we replaced Transformer’s fixed stack of different transformation functions with several applications of a single, parallel-in-time recurrent transformation function (i.e. the same learned transformation function is applied to all symbols in parallel over multiple processing steps, where the output of each step feeds into the next).

    “Crucially, where an RNN processes a sequence symbol-by-symbol (left to right), Universal Transformer processes all symbols at the same time (like the Transformer), but then refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention. This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNN, and also makes the Universal Transformer more powerful than the standard feedforward Transformer. …”

The performance benchmarks for Universal Transformer on the bAbI dataset (especially the more difficult “10k examples”) are particularly impressive (Table 1
Source
in their paper; note also the MemN2N comparison). Appendix C shows the bAbI attention visualizations, of which the last example is particularly impressive (requiring three supportive facts to solve).

In August 2018 Google AI followed their Universal Transformers paper with Character-Level Language Modeling with Deeper Self-Attention [discussion], which showed that a deep (64-layer) Transformer model with fixed context outperformed RNN variants by a large margin, achieving state of the art on two popular benchmarks.

arxiv-1808.04444a.png

[Image source. Click image to open in new window.]


arxiv-1808.04444b.png

[Image source. Click image to open in new window.]


  • LSTM and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts.

  • While code is not yet released (2018-08-16), it will likely appear in Google’s TensorFlow tensor2tensor GitHub repository, “home” of their Transformer code.

  • For reference, in Learning to Generate Reviews and Discovering Sentiment (Apr 2017) OpenAI also trained a byte (character)-level RNN-based language model (a single layer multiplicative LSTM with 4096 units trained for a single epoch on an Amazon product review dataset), that even with data-parallelism across 4 Pascal Titan X GPU for which training took approximately one month.

    • However, RNN/CNN handle input sequences sequentially word-by-word which is an obstacle to parallelization. I am unsure how long it takes to train Google’s Transformer algorithm, which achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence, leading to significantly shorter training times (The Transformer – Attention is All You Need).

      This GitHub Issue discusses parallelization over GPU and training times, etc., indicating that results are GPU-number and batch size dependent. The Annotated Transformer also discusses this; under their setup, (8 NVIDIA P100 GPU; parametrization; …) they trained the base models for a total of 100,000 steps (12 hrs); big models were trained for 300,000 steps (3.5 days).

Like Google, Facebook AI Research has also developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation (May 2018) [code/pretrained modelsdiscussion]), demonstrated via story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation).

  • They tackled the challenges of story-telling with a hierarchical model, which first generated a sentence called “the prompt” (describing the topic for the story), and then “conditioned” on this prompt when generating the story. Conditioning on the prompt or premise made it easier to generate consistent stories, because they provided grounding for the overall plot. It also reduced the tendency of standard sequence models to drift off topic.

  • To improve the relevance of the generated story to its prompt, they adopted the fusion mechanism from Cold Fusion: Training Seq2Seq Models Together with Language Models):

    The cold fusion mechanism of Sriram et al. (2017) pretrains a language model and subsequently trains a seq2seq model with a gating mechanism that learns to leverage the final hidden layer of the language model during seq2seq training [their language model contained three layers of gated recurrent units (GRUs)]. The model showed, for the first time, that fusion mechanisms could help seq2seq models build dependencies between their input and output.

  • To improve over the pretrained model, the second model had to focus on the link between the prompt and the story. Since existing convolutional architectures only encode a bounded amount of context, they introduced a novel gated self-attention mechanism that allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context).

  • Similar to Google’s Transformer, Facebook AI Research used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (see Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with gated linear unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

    arxiv-1805.04833a.png

    [Image source. Click image to open in new window.]


    arxiv-1805.04833b.png

    [Image source. Click image to open in new window.]


    arxiv-1805.04833c.png

    [Image source. Click image to open in new window.]


Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding (Aug 2018) proposed a new self-attention mechanism for sentence embedding, Dynamic Self-Attention (DSA). They designed DSA by modifying dynamic routing in capsule networks for use in NLP. DSA attended to informative words with a dynamic weight vector, achieving new state of the art results among sentence encoding methods on the Stanford Natural Language Inference (SNLI) dataset – with the least number of parameters – while showing comparative results in Stanford Sentiment Treebank (SST) dataset. With the dynamic weight vector, the self attention mechanism could be furnished with flexibility, rendering it more effective for sentence embedding.

arxiv-1808.07383a.png

[Image source. Click image to open in new window.]


arxiv-1808.07383b.png

[Image source. Click image to open in new window.]


Learning to Compose Neural Networks for Question Answering (Jun 2016) [codeauthor discussion] presented a compositional, attentional model for answering questions about a variety of world representations including images and structured knowledge bases. The model used natural language strings to automatically assemble neural networks from a collection of composable modules. Parameters for these modules were learned jointly with network-assembly parameters via reinforcement learning, with only (world, question, answer) triples as supervision. This approach, termed a Dynamic Neural Model Network, achieved state of the art results on benchmark datasets in both visual and structured domains. The model “translates” from questions to dynamically assembled neural networks, then applies these networks to world representations (images or knowledge bases) to produce answers. The model has two components, trained jointly: a collection of neural “modules” that can be freely composed, and a network layout predictor that assembles modules into complete deep networks tailored to each question (see their Figure 1). Training data consisted of (world, question, answer) triples: the approach required no supervision of the network layouts. They achieved state of the art performance on two markedly different question answering tasks: questions about natural images, and more compositional questions about United States geography.

arxiv-1601.01705d.png

[Image source. Click image to open in new window.]


Relevant to the following paragraph, in NLP parts of speech (POS) content words are words that name objects of reality and their qualities. They signify actual living things (dog, cat, etc.), family members (mother, father, sister, etc.), natural phenomena (snow, Sun, etc.) common actions (do, make, come, eat, etc.), characteristics (young, cold, dark, etc.), etc. Content words consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words. Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as prepositions (in, out, under, etc.), pronouns (I, you, he, who, etc.), conjunctions (and, but, till, as, etc.), etc.

Most models based on the seq2seq model with an encoder-decoder framework are equipped with an attention mechanism, like Google’s Transformer mechanism. However, these conventional attention mechanisms treat the decoding at each time step equally with the same matrix, which is problematic since the softness of the attention for different types of words (e.g. content words and function words) should differ. Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation (Aug 2018) [code: not yet available, 2018-10-10] addressed this issue, proposing a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an attention temperature. They set a temperature parameter which could be learned by the model based on the attentions in the previous decoding time steps, as well as the output of the decoder at the current time step. With the temperature parameter, the model was able to automatically tune the degree of softness of the distribution of the attention scores. Specifically, the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. In a neural machine translation task, they showed that SACT could attend to the most relevant elements in the source-side contexts, generating translation of high quality.

Attention and Memory:

Additional Reading

  • Pay Less Attention with Lightweight and Dynamic Convolutions (ICLR 2019) [discussion]

    “We presented lightweight convolutions which perform competitively to the best reported results in the literature despite their simplicity. They have a very small parameter footprint and the kernel does not change over time-steps. This demonstrates that self-attention is not critical to achieve good accuracy on the language tasks we considered. Dynamic convolutions build on lightweight convolutions by predicting a different kernel at every time-step, similar to the attention weights computed by self-attention. The dynamic weights are a function of the current time-step only rather than the entire context. Our experiments show that lightweight convolutions can outperform a strong self-attention baseline on WMT’17 Chinese-English translation, IWSLT’14 German-English translation and CNN-DailyMail summarization. Dynamic convolutions improve further and achieve a new state of the art on the test set of WMT’14 English-German. Both lightweight convolution and dynamic convolution are 20% faster at runtime than self-attention. On Billion Word language modeling we achieve comparable results to self-attention,”

    PayLessAttention-a.png

    [Image source. Click image to open in new window.]


    PayLessAttention-b.png

    [Image source. Click image to open in new window.]


  • Long Short-Term Attention (Oct 2018) [see also]

    “In order to learn effective features from temporal sequences, the long short-term memory (LSTM) network is widely applied. A critical component of LSTM is the memory cell, which is able to extract, process and store temporal information. Nevertheless, in LSTM, the memory cell is not directly enforced to pay attention to a part of the sequence. Alternatively, the attention mechanism can help to pay attention to specific information of data. In this paper, we present a novel neural model, called long short-term attention (LSTA), which seamlessly merges the attention mechanism into LSTM. More than processing long short term sequences, it can distill effective and valuable information from the sequences with the attention mechanism. Experiments show that LSTA achieves promising learning performance in various deep learning tasks.”

    arxiv1810.1275-f1+f2.png

    [Image source. Click image to open in new window.]


    arxiv1810.1275-t1+f5+f6+t2.png

    [Image source. Click image to open in new window.]


  • Those [same authors] contemporaneously co-published this paper, Recurrent Attention Unit. [see also]

    “Recurrent Neural Network (RNN) has been successfully applied in many sequence learning problems. Such as handwriting recognition, image description, natural language processing and video motion analysis. After years of development, researchers have improved the internal structure of the RNN and introduced many variants. Among others, Gated Recurrent Unit (GRU) is one of the most widely used RNN model. However, GRU lacks the capability of adaptively paying attention to certain regions or locations, so that it may cause information redundancy or loss during leaning. In this paper, we propose a RNN model, called Recurrent Attention Unit (RAU), which seamlessly integrates the attention mechanism into the interior of GRU by adding an attention gate. The attention gate can enhance GRU’s ability to remember long-term memory and help memory cells quickly discard unimportant content. RAU is capable of extracting information from the sequential data by adaptively selecting a sequence of regions or locations and pay more attention to the selected regions during learning. Extensive experiments on image classification, sentiment classification and language modeling show that RAU consistently outperforms GRU and other baseline methods.”

    arxiv1810.12754-f1+f2.png

    [Image source. Click image to open in new window.]


    arxiv1810.12754-t1+t2+t3.png

    [Image source. Click image to open in new window.]


  • You May Not Need Attention (Oct 2018) [code;   author’s discussion]

    “In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decoder. Our eager translation model is low-latency, writing target tokens as soon as it reads the first source token, and uses constant memory during decoding. It performs on par with the standard attention-based model of Bahdanau et al. (2014), and better on long sentences.”

    arxiv1810.13409-f1+f2+t1+t2+t3.png

    [Image source. Click image to open in new window.]


  • Convolutional Self-Attention Network (Oct 2018)

    Self-attention network (SAN ) has recently attracted increasing interest due to its fully parallelized computation and flexibility in modeling dependencies. It can be further enhanced with multi-headed attention mechanism by allowing the model to jointly attend to information from different representation subspaces at different positions (Vaswani et al., 2017). In this work, we propose a novel convolutional self-attention network (CSAN), which offers SAN the abilities to (1) capture neighboring dependencies, and (2) model the interaction between multiple attention heads. Experimental results on WMT14 English-to-German translation task demonstrate that the proposed approach outperforms both the strong Transformer baseline and other existing works on enhancing the locality of SAN. Comparing with previous work, our model does not introduce any new parameters.”

    arxiv1810.13320-f1.png

    [Image source. Click image to open in new window.]


    arxiv1810.13320-t1.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Attention: Miscellaneous Applications

Although the following content is more NLP task related, I wanted to group this content close to the discussions of language models and attentional mechanisms, discussed in my “Attention and Memory subsection. Recent applications of Google’s “Transformer” and other attentional architectures relevant to this REVIEW include their use in NLP orientated tasks such as “slot filling” (relation extraction), question answering, and document summarization.

Position-aware Self-attention with Relative Positional Encodings for Slot Filling (Bilan and Roth, July 2018) applied self-attention with relative positional encodings to the task of relation extraction; their model relied solely on attention: no recurrent or convolutional layers were used. The authors employed Google’s Transformer seq2seq model, also known as as multi-head dot product attention (MHDPA) or “self-attention.”

  • Despite citing Zhang et al. (Stanford University; coauthored by Christopher Manning)’s 2017 paper Position-aware Attention and Supervised Data Improve Slot Filling and using the TACRED relation extraction dataset introduced by Zhang et al. in their paper, Bilan and Roth claim

    To the best of our knowledge, the transformer model has not yet been applied to relation classification as defined above (as selecting a relation for two given entities in context). ”

    Furthermore, they provide no code, while Zhang et al. released their code, and included ablation studies in their work. The attention mechanism used by Zhang et al. differed significantly from the Google Transformer model in their use of the summary vector and position embeddings, and the way their attention weights were computed. While Zhang et al.’s $\small F_1$ scores (their Table 4) were slightly lower than Bilan and Roth’s on the TACRED dataset (see Bilan and Roth’s Table 1), the ensemble model used by Zhang et al. had the best scores. Sample relations extracted from a sentence are shown in Fig. 1 and Table 1 in Zhang et al.

    Bilan2018_Zhang2017.png

    [Click image to open in new window.]


Discussed elsewhere in this REVIEW, Bidirectional Attention Flow for Machine Comprehension (Nov 2016; updated Jun 2018) introduced the BiDAF framework, a multi-stage hierarchical process that used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization. BiDAF was subsequently used in QA4IE: A Question Answering based Framework for Information Extraction (Apr 2018) (jointly discussed with the BiDAF paper discussion), a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, also using a knowledge base (Wikipedia Ontology) for entity recognition.

Related to attention-based relation extraction, Neural Architectures for Open-Type Relation Argument Extraction (Sep 2018) redefined the problem of slot filling to the task of Open-type Relation Argument Extraction (ORAE): given a corpus, a query entity $\small Q$ and a knowledge base relation (e.g.,”$\small Q$ authored notable work with title $\small X$”), the model had to extract an argument of “non-standard entity type” (entities that cannot be extracted by a standard named entity tagger) from the corpus – hence, “open-type argument extraction.” This work also employed the Transformer architecture, used as a multi-headed self-attention mechanism in their encoders for computing sentence representations suitable for argument extraction.

The approach for ORAE had two conceptual advantages. First, it was more general than slot-filling as it was also applicable to non-standard named entity types that could not be dealt with previously. Second, while the problem they defined was more difficult than standard slot filling, they eliminated an important source of errors: tagging errors that propagate throughout the pipeline and that are notoriously hard to correct downstream. A wide range of neural network architectures to solve ORAE were examined, each consisting of a sentence encoder, which computed a vector representation for every sentence position, and an argument extractor, which extracted the relation argument from that representation. The combination of a RNN encoder with a CRF extractor gave the best results, +7% absolute $\small \text{F-measure}$ better than a previously proposed adaptation of a state of the art question answering model (BiDAF). [“The dataset and code will be released upon publication.” – not available, 2018-10-10.]

arxiv-1803.01707d.png

[Image source. Click image to open in new window.]


Discussed elsewhere in this REVIEW and briefly mentioned here, Generating Wikipedia by Summarizing Long Sequences (Jan 2018)] by Google Brain employed Wikipedia in a supervised machine learning task for multi-document summarization, using extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. They modified their Transformer architecture to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. These improvements allowed them to generate entire Wikipedia articles.

Because the amount of text in input reference documents can be very large (see their Table 2) it was infeasible to train an end-to-end abstractive model, given the memory constraints of current hardware. Hence, they first coarsely selected a subset of the input using extractive summarization. The second stage involved training an abstractive model that generated the Wikipedia text while conditioning on this extraction. This two-stage process was inspired by by how humans might summarize multiple long documents: first highlighting pertinent information, then conditionally generating the summary based on the highlights.

Hierarchical Bi-Directional Attention-based RNNs for Supporting Document Classification on Protein-Protein Interactions Affected by Genetic Mutations (Jan 2018) [code] leveraged word embeddings trained on PubMed abstracts. The authors argued that the title of a paper usually contains important information that is more salient than a typical sentence in the abstract; they therefore proposed a shortcut connection that integrated the title vector representation directly to the final feature representation of the document. They concatenated the sentence vector that represented the title and the vectors of the abstract, to the document feature vector used as input to the task classifier. This system ranked first among the Document Triage Task of the BioCreative VI Precision Medicine Track.

Fergadis2018a.png

[Image source. Click image to open in new window.]


Fergadis2018b.png

[Image source. Click image to open in new window.]


Critique:

  • The “spirit” of the BioCreative VI Track 4: Mining protein interactions and mutations for precision medicine (PM) is (bolded emphasis mine, below):

    “The precision medicine initiative (PMI) promises to identify individualized treatment depending on a patients’ genetic profile and their related responses. In order to help health professionals and researchers in the precision medicine endeavor, one goal is to leverage the knowledge available in the scientific published literature and extract clinically useful information that links genes, mutations, and diseases to specialized treatments. … Understanding how allelic variation and genetic background influence the functionality of these pathways is crucial for predicting disease phenotypes and personalized therapeutical approaches. A crucial step is the mapping of gene products functional regions through the identification and study of mutations (naturally occurring or synthetically induced) affecting the stability and affinity of molecular interactions.”

  • Against those criteria and despite the title of this paper and this excerpt from the paper,

    “In order to incorporate domain knowledge in our system, we annotate all biomedical named entities namely genes, species, chemical, mutations and diseases. Each entity mention is surround by its corresponding tags as in the following example: Mutations in <species>human</species> <gene>EYA1</gene> cause <disease>branchio-oto-renal (BOR) syndrome</disease> …”

    … there is no evidence that mutations (i.e. genomic variants) were actually tagged. Mutations/variants are not discussed, nor is there any mention of “mutant” or “mutation” in their GitHub repository/code nor the parent repo.

  • Richard Socher and colleagues (SalesForce) also used shortcut connections [i.e., “residual layers” (Deep Residual Learning for Image Recognition)] in their paper A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks from higher layers to lower layers (lower-level task predictions) – in their tasks to reflect linguistic hierarchies.

Identifying interactions between proteins is important to understand underlying biological processes. Extracting protein-protein interactions (PPI) from raw text is often very difficult. Previous supervised learning methods have used handcrafted features on human-annotated data sets. Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2108) proposed a novel tree recurrent neural network with structured attention architecture for doing PPI. Their architecture achieved state of the art results on the benchmark AIMed and BioInfer data sets; moreover, their models achieved significant improvement over previous best models without any explicit feature extraction. Experimental results showed that traditional recurrent networks had inferior performance compared to tree recurrent networks, for the supervised PPI task.

“… we propose a novel neural net architecture for identifying protein-protein interactions from biomedical text using a Tree LSTM with structured attention. We provide an in depth analysis of traversing the dependency tree of a sentence through a child sum tree LSTM and at the same time learn this structural information through a parent selection mechanism by modeling non-projective dependency trees.”

arxiv1503.00075-f1+f2.png

[Image source (Kai Sheng Tai, Richard Socher, Christopher D. Manning). Click image to open in new window.]


arxiv1808.03227-f1.png

[Image source. Click image to open in new window.]


arxiv1808.03227-f2.png

[Image source. Click image to open in new window.]


arxiv1808.03227-t3.png

[Image source. Click image to open in new window.]


[Table of Contents]

Pointer Mechanisms; Pointer-Generators

  • Paulus et al. [Richard Socher] A Deep Reinforced Model for Abstractive Summarization (Nov 2017)

    • “To generate a token, our decoder uses either a token-generation softmax layer or a pointer mechanism to copy rare or unseen from the input sequence. We use a switch function that decides at each decoding step whether to use the token generation or the pointer (Gulcehre et al., 2016Nallapati et al., 2016).”

    • “Neural Encoder-Decoder Sequence Models. Neural encoder-decoder models are widely used in NLP applications such as machine translation, summarization, and question answering. These models use recurrent neural networks (RNN), such as long-short term memory network (LSTM) to encode an input sentence into a fixed vector, and create a new output sequence from that vector using another RNN. To apply this sequence-to-sequence approach to natural language, word embeddings are used to convert language tokens to vectors that can be used as inputs for these networks.

      Attention mechanisms make these models more performant and scalable, allowing them to look back at parts of the encoded input sequence while the output is generated. These models often use a fixed input and output vocabulary, which prevents them from learning representations for new words. One way to fix this is to allow the decoder network to point back to some specific words or sub-sequences of the input and copy them onto the output sequence. Gulcehre et al. (2016) and Merity et al. (2017) combine this pointer mechanism with the original word generation layer in the decoder to allow the model to use either method at each decoding step.

  • See et al. [Christopher Manning] Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017)

    • “Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing , which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator . Second, we use coverage to keep track of what has been summarized, which discourages repetition.”

    • “The pointer network (Vinyals et al., 2015) is a sequence-to-sequence model that uses the soft attention distribution of Bahdanau et al. (2015) to produce an output sequence consisting of elements from the input sequence. … Our approach is considerably different from that of Gulcehre et al. (2016)Nallapati et al. (2016). Those works train their pointer components to activate only for out-of-vocabulary words or named entities (whereas we allow our model to freely learn when to use the pointer), and they do not mix the probabilities from the copy distribution and the vocabulary distribution. We believe the mixture approach described here is better for abstractive summarization – in section 6 we show that the copy mechanism is vital for accurately reproducing rare but in-vocabulary words, and in section 7.2 we observe that the mixture model enables the language model and copy mechanism to work together to perform abstractive copying.”

    • “Our hybrid pointer-generator network facilitates copying words from the source text via pointing (Vinyals et al., 2015), which improves accuracy and handling of OOV words, while retaining the ability to generate new words. The network, which can be viewed as a balance between extractive and abstractive approaches, is similar to Gu et al.’s (2016) CopyNet
      Source
      and Miao and Blunsom’s (2016) Forced-Attention Sentence Compression
      Source
      , that were applied to short-text summarization. We propose a novel variant of the coverage vector
      Source
      (Tu et al., 2016) from Neural Machine Translation, which we use to track and control coverage of the source document. We show that coverage is remarkably effective for eliminating repetition.”

      arxiv1704.04368-f1.png

      [Image source. Click image to open in new window.]


      arxiv1704.04368-f3.png

      [Image source. Click image to open in new window.]


      arxiv1704.04368-f5.png

      [Image source. Click image to open in new window.]


  • Merity et al. [Richard Socher | MetaMind/Salesforce] Pointer Sentinel Mixture Models (Sep 2016):

    • See also the description of the dataset they created, WikiText-103.

    • “Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM.”

    • “Pointer networks (Vinyals et al., 2015) provide one potential solution for rare and out of vocabulary (OOV) words as a pointer network uses attention to select an element from the input as output. This allows it to produce previously unseen input tokens. While pointer networks improve performance on rare words and long-term dependencies they are unable to select words that do not exist in the input.

      “We introduce a mixture model, illustrated in Fig. 1, that combines the advantages of standard softmax classifiers with those of a pointer component for effective and efficient language modeling. Rather than relying on the RNN hidden state to decide when to use the pointer, as in the recent work of Gulcehre et al. (2016), we allow the pointer component itself to decide when to use the softmax vocabulary through a sentinel. The model improves the state of the art perplexity on the Penn Treebank. Since this commonly used dataset is small and no other freely available alternative exists that allows for learning long range dependencies, we also introduce a new benchmark dataset for language modeling called WikiText.”

      arxiv1609.07843-f1.png

      [Image source. Click image to open in new window.]


      arxiv1609.07843-f2.png

      [Image source. Click image to open in new window.]


      arxiv1609.07843-f7.png

      [Image source. Click image to open in new window.]


  • Nallapati et al. [Caglar Gulcehre] Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (Aug 2016)

    • “Often-times in summarization, the keywords or named-entities in a test document that are central to the summary may actually be unseen or rare with respect to training data. Since the vocabulary of the decoder is fixed at training time, it cannot emit these unseen words. Instead, a most common way of handling these out-of-vocabulary (OOV) words is to emit an ‘UNK’ token as a placeholder. However this does not result in legible summaries. In summarization, an intuitive way to handle such OOV words is to simply point to their location in the source document instead. We model this notion using our novel switching decoder/pointer architecture which is graphically represented in Figure 2. In this model, the decoder is equipped with a ‘switch’ that decides between using the generator or a pointer at every time-step. If the switch is turned on, the decoder produces a word from its target vocabulary in the normal fashion. However, if the switch is turned off, the decoder instead generates a pointer to one of the word-positions in the source. The word at the pointer-location is then copied into the summary. The switch is modeled as a sigmoid activation function over a linear layer based on the entire available context at each time-step as shown below. …”

    • “The pointer mechanism may be more robust in handling rare words because it uses the encoder’s hidden-state representation of rare words to decide which word from the document to point to. Since the hidden state depends on the entire context of the word, the model is able to accurately point to unseen words although they do not appear in the target vocabulary. [Even when the word does not exist in the source vocabulary, the pointer model may still be able to identify the correct position of the word in the source since it takes into account the contextual representation of the corresponding ‘UNK’ token encoded by the RNN. Once the position is known, the corresponding token from the source document can be displayed in the summary even when it is not part of the training vocabulary either on the source side or the target side.] …”

      arxiv1602.06023-f2.png

      [Image source. Click image to open in new window.]


      arxiv1602.06023-fig3.png

      [Image source. Click image to open in new window.]


      arxiv1602.06023-f4.png

      [Image source. Click image to open in new window.]


  • Gulcehre et al. [Yoshua Bengio], Pointing the Unknown Words (Aug 2016)

    • “The attention-based pointing mechanism is introduced first in the pointer networks (Vinyals et al., 2015). In the pointer networks, the output space of the target sequence is constrained to be the observations in the input sequence (not the input space). Instead of having a fixed dimension softmax output layer, softmax outputs of varying dimension is dynamically computed for each input sequence in such a way to maximize the attention probability of the target input. However, its applicability is rather limited because, unlike our model, there is no option to choose whether to point or not; it always points. In this sense, we can see the pointer networks as a special case of our model where we always choose to point a context word.”

  • Vinyals O et al., Pointer Networks (Jun 2015; updated Jan 2017)

    • “We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). …”

      arxiv1506.03134-f1.png

      [Image source. Click image to open in new window.]


[Table of Contents]

Language Models

A recent and particularly exciting advance in NLP is the development of pretrained language models such as

Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks.

ImageNet [datasetcommunity challengesdiscussion  (local copy)] and related image classification, segmentation, and captioning challenges have had an enormous impact on the advancement of computer vision, deep learning, deep learning architecture, the use of pretrained modelstransfer learningattentional mechanisms, etc. Studies arising from the ImageNet dataset have as well identified gaps in our understanding of that success, leading to demystifying how those deep neural networks classify images (explained very well in this excellent video, below), and other issues including the vulnerability of deep learning to the impact of adversarial attacks. It is anticipated that pretrained language models will have a parallel impact, in the NLP domain.

The availability of pretrained models is an important and practical advance in machine learning, as many of the current processing tasks in image processing and NLP language modeling are extremely computationally intensive. For example:

ELMo

ELMo (“Embeddings from Language Models”) was introduced in Deep Contextualized Word Representations (Feb 2018; updated Mar 2018) [project;  tutorials here and here;  discussion here, here and here] by authors at the Allen Institute for Artificial Intelligence and the Paul G. Allen School of Computer Science & Engineering at the University of Washington. ELMo modeled both the complex characteristics of word use (e.g., syntax and semantics), and how these characteristics varied across linguistic contexts (e.g., to model polysemy: words or phrases with different, but related, meanings). These word vectors were learned functions of the internal states of a deep bidirectional language model (two Bi-LSTM layers), which was pretrained on a large text corpus.

Unlike most widely used word embeddings, ELMo word representations were deep, in that they were a function of the internal, hidden layers of the bi-directional Language Model (biLM), providing a very rich representation. More specifically, ELMo learned a linear combination of the vectors stacked above each input word for each end task, which markedly improved performance over using just the top LSTM layer. These word vectors could be easily added to existing models, significantly improving the state of the art across a broad range of challenging NLP problems including question answering, textual entailment, semantic role labeling, coreference resolution, named entity extraction, and sentiment analysis. The addition of *ELMo representations alone significantly improved the state of the art in every case, including up to 20% relative error reductions.*

ELMo.png

[Image source (based on data from Table 1 in arXiv:1802.05365. Click image to open in new window.
Tasks: SQuAD: question answering; SNLI: textual entailment; SRL: semantic role labeling; Coref: coreference resolution; NER: named entity recognition; SST-5: sentiment analysis. SOTA: state of the art.]


ULMFiT

Jeremy Howard (fast.ai and the University of San Francisco) and Sebastian Ruder (Insight Centre for Data Analytics, NUI Galway, and Aylien Ltd.) described their Universal Language Model Fine-tuning (ULMFiT) model in Universal Language Model Fine-tuning for Text Classification (Jan 2018; updated May 2018) [project/code;  code here  herehere and here;  discussion herehereherehereherehere and here]. Their language model was a transfer learning method that could be applied to any task in NLP [but as of July 2018 they had only studied its use in classification tasks] as well as key techniques for fine-tuning a language model. They also provided the fastai.text and fastai.lm_rnn modules necessary to train/use their ULMFiT models.

ULMFiT significantly outperformed the state of the art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, ULMFiT matched the performance of training from scratch on 100x more data.*

arxiv-1801.06146.png

[Image source. Click image to open in new window.]


Finetuned Transformer LM

Finetuned Transformer LM  (Radford et al., Improving Language Understanding by Generative Pre-Training) (Jun 2018) [projectcode;  discussion: here and here] was introduced by Ilya Sutskever and colleagues at OpenAI. They demonstrated that large gains on diverse natural language understanding (NLU) tasks – such as textual entailment, question answering, semantic similarity assessment, and document classification – could be realized by a two stage training procedure: generative pretraining of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, they made use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

Finetuned Transformer LM [aka: OpenAI Transformer  |  OpenAI GPT] provided a convincing example that pairing supervised learning methods with unsupervised pretraining works very well, demonstrating the effectiveness of their approach on a wide range of NLU benchmarks. Their general task-agnostic model outperformed discriminatively trained models that used architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied (see the table
Source
at Improving Language Understanding with Unsupervised Learning). For instance, they achieved absolute improvements of 8.9% on commonsense reasoning, 5.7% on question answering, and 1.5% on textual entailment (natural language inference).

OpenAI_Transformer.png

[Image source. Click image to open in new window.]


  • The architecture employed in Improving Language Understanding by Generative Pre-Training, Finetuned Transformer LM, was Google’s Transformer, a seq2seq based self-attention mechanism. This model choice provided OpenAI with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, they utilized task-specific input adaptations derived from traversal-style approaches, which processed structured text input as a single contiguous sequence of tokens. As they demonstrated in their experiments, these adaptations enabled them to fine-tune effectively with minimal changes to the architecture of the pretrained model. OpenAI’s Finetuned Transformer LM model largely followed the original (Google’s Attention Is All You Need) Transformer work: OpenAI trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads) …

  • The Transformer architecture does not use RNN (LSTM), relying instead on the use of the self-attention mechanism. In Improving Language Understanding by Generative Pre-Training, the OpenAI authors asserted that the use of LSTM models employed in ELMo and ULMFiT restricted the prediction ability of those language models to a short range. In contrast, OpenAI’s choice of Transformer networks allowed them to capture longer-range linguistic structure. Regarding better understanding of why the pretraining of language models by Transformer architectures was effective, they hypothesized that the underlying generative model learned to perform many of the evaluated tasks in order to improve its language modeling capability, and that the more structured attentional memory of the Transformer assisted in transfer, compared to LSTM.

BERT

In October 2018 Google Language AI presented BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [code: Google Research | author’s discussion]. Unlike recent language representation models, BERT – which stands for Bidirectional Encoder Representations from Transformers – is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, pre-trained BERT representations can be fine-tuned with just one additional output layer to create state of the art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT obtained new state of the art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

  • For a description of the GLUE datasets used in this paper, refer here.

  • [Google AI Blog: Nov 2018 – short, very descriptive summary (local copy)] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

    “… What Makes BERT  Different? BERT builds upon recent work in pre-training contextual representations - including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia). …”

    “…The Strength of Bidirectionality, If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model. To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:

    BERT_ex1.png

    [Image source. Click image to open in new window.]


    “While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network. BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:

    BERT_ex2.png

    [Image source. Click image to open in new window.]


    “… On SQuAD v1.1, BERT achieves 93.2% $\small F_1$ score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%. … BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them. …”

  • Community discussion hereherehere, and here: Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks

  • Non-author code

arxiv1810.04805a.png

[Image source. BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. Click image to open in new window.]


arxiv1810.04805b.png

[Image source. Click image to open in new window.]


arxiv1810.04805c.png

[Image source. Click image to open in new window.]


In the following figure, note that the results in Table 2 were on the less-challenging (viz-a-viz SQuAD2.0) SQuAD1.1 QA dataset:

arxiv1810.04805g.png

[Image source. Click image to open in new window.]


Some highlights, excerpted from Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks:

  • NLP researchers are exploiting today’s large amount of available language data and maturing transfer learning techniques to develop novel pre-training approaches. They first train a model architecture on one language modeling objective, and then fine-tune it for a supervised downstream task. Aylien Research Scientist Sebastian Ruder suggests in his blog that pre-trained models may have “the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.”

  • The BERT model architecture is a bidirectional Transformer encoder. The use of [Google’s] Transformer comes as no surprise – this is a recent trend due Transformers’ training efficiency and superior performance in capturing long-distance dependencies compared to a recurrent neural network architecture. The bidirectional encoder meanwhile is a standout feature that differentiates BERT from OpenAI GPT [i.e. Finetuned Transformer LM | OpenAI Transformer – a left-to-right Transformer] and ELMo (a concatenation of independently trained left-to-right and right- to-left LSTM).

  • BERT is a huge model, with 24 Transformer blocks, 1024 hidden layers, and 340M parameters.

  • The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). The model runs on 16 TPU pods for training.

  • In the pre-training process, researchers took an approach which involved randomly masking a percentage of the input tokens (15 percent) to train a deep bidirectional representation. They refer to this method as a Masked Language Model (MLM).

  • A pre-trained language model cannot understand relationships between sentences, which is vital to language tasks such as question answering and natural language inferencing. Researchers therefore pre-trained a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.

  • The fine-tuned model for different datasets improves the GLUE benchmark to 80.4 percent (7.6 percent absolute improvement), MultiNLI accuracy to 86.7 percent (5.6 percent absolute improvement), the SQuAD1.1 question answering test $\small F_1$ to 93.2 (1.5 absolute improvement), and so on over a total of 11 language tasks.

Transformer

In mid-2017, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks in an encoder-decoder configuration: the best performing models connected the encoder and decoder through an attention mechanism. In June 2017 Vaswani et al. at Google proposed a new simple network architecture, Transformer, that was based solely on attention mechanisms – thus dispensing with recurrence and convolutions entirely, also allowing for significantly more parallelization (Attention Is All You Need (Jun 2017; updated Dec 2017) [code]). [The inherently sequential nature of RNN precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.] Transformer has been shown to perform strongly on machine translation, document generation, syntactic parsing and other tasks. Experiments on two machine translation tasks showed these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Transformer also generalized well to other tasks; for example, it was successfully applied to English constituency parsing, both with large and limited training data.

arxiv-1706.03762a.png

[Image source. Click image to open in new window.]


arxiv-1706.03762b.png

[Image source. Click image to open in new window.]


  • Transformer is discussed in Google AI’s August 2017 blog post Transformer: A Novel Neural Network Architecture for Language Understanding:

    • “… The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations. The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.”
        transform20fps.gif
        [Image source. Click image to open in new window.]


  • In the literature, Google’s Transformer is also referred to as Multi-head dot product attention (MHDPA), and “self-attention. ”

  • Due to the absence of recurrent layers in the model, the Transformer model trained significantly faster and outperformed all previously reported ensembles.

  • Alexander Rush at HarvardNLP provides an excellent web page, The Annotated Transformer, complete with discussion and code (an “annotated” version of the paper in the form of a line-by-line implementation) [papercode]!

  • Discussion:
  • User implementations:
  • Attention Is All You Need coauthor Łukasz Kaiser posted slides describing this work (Tensor2Tensor Transformers New Deep Models for NLP  [local copydiscussion].

Later in 2018, Li et al. [Lukasz Kaiser; Samy Bengio | Google Research/Brain] described “Area Attention” (Oct 2018).

“Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.”

arxiv1810.10126-f1+f2.png

[Image source. Click image to open in new window.]


Trellis Networks

for Sequence Modeling (Oct 2018; note also the Appendices) [codediscussion] by authors at Carnegie Mellon University and Intel Labs presented trellis networks, a new architecture for sequence modeling. A trellis network is a temporal convolutional network with special structure, characterized by weight tying across depth and direct injection of the input into deep layers. The authors show that truncated recurrent networks are equivalent to trellis networks with special sparsity structure in their weight matrices; thus trellis networks with general weight matrices generalize truncated recurrent networks. They leveraged those connections to design high-performing trellis networks that absorb structural and algorithmic elements from both recurrent and convolutional models. Experiments demonstrated that trellis networks outperform the current state of the art on a variety of challenging benchmarks, including word-level language modeling on Penn Treebank and WikiText-103, character-level language modeling on Penn Treebank, and stress tests designed to evaluate long-term memory retention.

arxiv1810.06682-f1.png

[Image source. Click image to open in new window.]


arxiv1810.06682-f2.png

[Image source. Click image to open in new window.]


arxiv1810.06682-f3.png

[Image source. Click image to open in new window.]


arxiv1810.06682-t1+t2.png

[Image source. Click image to open in new window.]


We presented trellis networks, a new architecture for sequence modeling. Trellis networks form a structural bridge between convolutional and recurrent models. …”

“There are many exciting opportunities for future work. First, we have not conducted thorough performance optimizations on trellis networks. … Future work can also explore acceleration schemes that speed up training and inference. Another significant opportunity is to establish connections between trellis networks and self-attention-based architectures (Transformers), thus unifying all three major contemporary approaches to sequence modeling. Finally, we look forward to seeing applications of trellis networks to industrial-scale challenges such as machine translation.”




Neural language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. In The Importance of Generation Order in Language Modeling Google Brain studied the influence of token generation order on model quality via a novel two-pass language model that produced partially-filled sentence “templates” and then filled in missing tokens. The most effective strategy generated function words in the first pass followed by content words in the second.

The Fine Tuning Language Models for Multilabel Prediction GitHub repository lists recent, leading language models – for which they examine the ability to use generative pretraining with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.

Adaptive Input Representations for Neural Language Modeling (Facebook AI Research; Oct 2018) [mentioned] introduced adaptive input representations for neural language modeling which extended the adaptive softmax of Grave et al. (2017 to input representations of variable capacity. This paper introduced adaptive input embeddings, which extended the adaptive softmax to input word representations. This factorization assigned more capacity to frequent words, and reduced the capacity for less frequent words with the benefit of reducing overfitting to rare words. There were several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. They performed a systematic comparison of popular choices for a self-attentional architecture. Their experiments showed that models equipped with adaptive embeddings were more than twice as fast to train than the popular character input CNN while having a lower number of parameters. They achieved a new state of the art on the WikiText-103 benchmark of 20.51 perplexity, improving the next best known result by 8.7 perplexity. On the Billion Word Benchmark, they achieved a state of the art of 24.14 perplexity.”

arxiv-1809.10853.png

[Image source. Click image to open in new window.]


arxiv1809.10853-t1+t2.png

[Image source. Click image to open in new window.]


  • Grave et al. (Facebook AI Research) Efficient softmax approximation for GPUs (Sep 2016; updated Jun 2017) [code]

    “We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computation time. Our approach further reduces the computational time by exploiting the specificities of modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax.”

    arxiv1609.04309-f3.png

    [Image source. Note similarity to Fig. 1 / use in Adaptive Input Representations for Neural Language Modeling. Click image to open in new window.]


[Table of Contents]

Probing the Effectiveness of Pretrained Language Models

Contextual word representations derived from pretrained bidirectional language models (biLM) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks, including question answering, entailment and sentiment classification, constituency parsing, named entity recognition, and text classification. However, many questions remain as to how and why these models are so effective.

Deep RNNs Encode Soft Hierarchical Syntax (May 2018), from Luke Zettlemoyer and colleagues at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, evaluated how well a simple feedforward classifier could detect syntax features (part of speech tags as well as various levels of constituent labels) from the word representations produced by the RNN layers from deep NLP models trained on the tasks of dependency parsing, semantic role labeling, machine translation, and language modeling. They demonstrated that deep RNN trained on NLP tasks learned internal representations that captured soft hierarchical notions of syntax across different layers of the model (i.e., the representations taken from deeper layers of the RNNs perform better on higher-level syntax tasks than those from shallower layers), without explicit supervision. These results provided some insight as to why deep RNNs are able to model NLP tasks without annotated linguistic features. ELMo, for example, represents each word using a task-specific weighted sum of the language models hidden layers; i.e., rather than using only the top layer, ELMo selects which of the language models internal layers contain the most relevant information for the task at hand.

An extremely interesting follow-on paper from Luke Zettlemoyer at the Paul G. Allen School of Computer Science & Engineering at the University of Washington, with colleagues at the Allen Institute for Artificial Intelligence – Dissecting Contextual Word Embeddings: Architecture and Representation (August 2018) [note also the Appendices in that paper] – presented a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. They showed there is a tradeoff between speed and accuracy, but all architectures learned high quality contextual representations that outperformed word embeddings for four challenging NLP tasks (natural language inference/textual entailment; semantic role labeling; constituency parsing; named entity recognition).

That study also showed that Deep biLM learned a rich hierarchy of contextual information, both at the word and span level, that was captured in three disparate types of network architectures (LSTM, CNN, or self attention). In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network in an analogous manner to how deep CNNs trained for image classification learn a hierarchy of image features (Zeiler and Fergus, 2014). For example, they showed that in contrast to traditional word vectors which encode some semantic information, the word embedding layer of deep biLMs focused exclusively on word morphology. Moving upward in the network, the lowest contextual layers of biLM focused on local syntax, while the upper layers could be used to induce more semantic content such as within-sentence pronominal coreferent clusters. They also showed that the biLM activations could be used to form phrase representations useful for syntactic tasks. Together, these results suggest that large scale biLM, independent of architecture, are learning much more about the structure of language than previous appreciated.

Regarding the following figure, note the more-or-less similar behavior of the three models on various tasks, that differ in difficulty/complexity; particularly note the changes in accuracy throughout the depth (layers) in those models. Layer-wise quantitative data are provided in the Appendix, in that paper.

arxiv-1808.08949.png

[Image source. Click image to open in new window.]


Evaluation of Sentence Embeddings in Downstream and Linguistic Probing Tasks (Jun 2018) surveyed recent unsupervised word embedding models, including fastText, ELMo, InferSent, and other models (discussed elsewhere in this REVIEW). They noted that two main challenges exist when learning high-quality representations: they should capture semanticssyntax,  and the different meanings the word can represent in different contexts (polysemy).

ELMo addressed both of those issues. As in fastText, ELMo breaks the tradition of word embeddings by incorporating sub-word units, but ELMo has also some fundamental differences with previous shallow representations such as fastText or Word2Vec. ELMo uses a deep representation by incorporating internal representations of the LSTM network, therefore capturing the meaning and syntactical aspects of words. Since ELMo is based on a language model, each token representation is a function of the entire input sentence, which can overcome the limitations of previous word embeddings where each word is usually modeled as an average of their multiple contexts. ELMo embeddings provide a better understanding of the contextual meaning of a word, as opposed to traditional word embeddings that are not only context-independent but have a very limited definition of context.

In that paper, it was also interesting to see how the different models performed on different tasks. For example:

  • As discussed in Section 5.1/Table 6
    Source
      (datasets
    Source
    ), ELMo (a language model that employs two Bi-LSTM layers), the Transformer (attention-based) version of USE (Universal Sentence Encoder), and InferSent (a Bi-LSTM trained on the SNLI dataset) generally performed well on downstream classification tasks (Table 6).

    “As seen in Table 6, although no method had a consistent performance among all tasks, ELMo achieved best results in 5 out of 9 tasks. Even though ELMo was trained on a language model objective, it is important to note that in this experiment a bag-of-words approach was employed. Therefore, these results are quite impressive, which lead us to believe that excellent results can be obtained by integrating ELMo and [it’s] trainable task-specific weighting scheme into InferSent. InferSent achieved very good results in the paraphrase detection as well as in the SICK-E (entailment). We hypothesize that these results were due to the similarity of these tasks to the tasks were InferSent was trained on (SNLI and MultiNLI). … The Universal Sentence Encoder (USE) model with the Transformer encoder also achieved good results on the product review (CR) and on the question-type (TREC) tasks. Given that the USE model was trained on SNLI as well as on web question-answer pages, it is possible that these results were also due to the similarity of these tasks to the training data employed by the USE model.”

  • Discussed in Section 5.2/Table 7
    Source
      (datasets
    Source
    ) USE-Transformer and InferSent performed the best on semantic relatedness and textual similarity tasks.

  • Discussed in Section 5.3/Table 8
    Source
      (datasets
    Source
    ) ELMo generally outperformed the other models on linguistic probing tasks.

  • Discussed in Section 5.4/Table 9
    Source
      (datasets
    Source
    ), InferSent outperformed the other models in information retrieval tasks.

Neural language models (LM) are more capable of detecting long distance dependencies than traditional n-gram models, serving as a stronger model for natural language. However, it is unclear what kind of properties of language these models encode, preventing their use as explanatory models, and relating them to formal linguistic knowledge of natural language. There is increasing interest in investigating the kinds of linguistic information that are represented by LM, with a strong focus on their syntactic abilities, as well as semantic understanding, such as negative polarity items (NPI). NPI are a class of words that bear the special feature that they need to be licensed by a specific licensing context (LC). A common example of an NPI and LC in English are any and not, respectively: the sentence “He didn’t buy any books.” is correct, whereas “He did buy any books.” is not correct.

Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items (Aug 2018) discussed language models and negative polarity items showing that the model found a relation between the licensing context and the negative polarity item and appeared to be aware of the scope of this context, which they extracted from a parse tree of the sentence. This research paves the way for other studies linking formal linguistics to deep learning.

arxiv-1808.10627c.png

[Image source. Click image to open in new window.]


Character language models have access to surface morphological patterns, but it is not clear whether or how they learn abstract morphological regularities. Indicatements that Character Language Models Learn English Morpho-syntactic Units and Regularities (Aug 2018) studied a “wordless” character language model with several probes, finding that it could develop a specific unit to identify word boundaries and, by extension, morpheme boundaries, which allowed it to capture linguistic properties and regularities of these units. Their language model proved surprisingly good at identifying the selectional restrictions of English derivational morphemes, a task that required both morphological and syntactic awareness. They concluded that, when morphemes overlap extensively with the words of a language, a character language model can perform morphological abstraction.

A morpheme is a meaningful morphological unit of a language that cannot be further divided; e.g., “incoming” consists of the morphemes “in”, “come” and “-ing”. Another example: “dogs” consists of two morphemes and one syllable: “dog”, and “-s”. A morpheme may or may not stand alone, whereas a word, by definition, is freestanding.

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (Sep 2018) [code] showed that each embedding model captured more information than is directly apparent, yet their potential performance is limited by the impossibility of optimally surfacing divergent linguistic information at the same time. For example, in word analogy experiments they were are able to achieve significant improvements over the original embeddings, yet every improvement in semantic analogies came at the cost of a degradation in syntactic analogies and vice versa. At the same time, their work showed that the effect of this phenomenon was different for unsupervised systems that directly used embedding similarities and supervised systems that use pretrained embeddings as features, as the latter had enough expressive power to learn the optimal balance themselves.

Relevant to the language models domain (if not directly employed), Firearms and Tigers are Dangerous, Kitchen Knives and Zebras are Not: Testing whether Word Embeddings Can Tell (Sep 2018) presented an approach to investigating the nature of semantic information captured by word embeddings. They tested the ability of supervised classifiers (a logistic regression classifier, and a basic neural network) to identify semantic features in word embedding vectors and compared this to a feature identification method based on full vector cosine similarity. The idea behind this method was that properties identified by classifiers (but not through full vector comparison) are captured by embeddings; properties that cannot be identified by either method are not captured by embeddings. Their results provided an initial indication that semantic properties relevant for the way entities interact (e.g. dangerous) were captured, while perceptual information (e.g. colors) were not represented.

Generative Adversarial Networks (GANs) are a promising approach for text generation that, unlike traditional language models (LM), does not suffer from the problem of “exposure bias”. However, A major hurdle for understanding the potential of GANs for text generation is the lack of a clear evaluation metric. Evaluating Text GANs as Language Models (Oct 2018) proposed approximating the distribution of text generated by a GAN, which permitted evaluating them with traditional probability-based LM metrics. They applied their approximation procedure on several GAN-based models, showing that they performed substantially worse than state of the art LMs. Their evaluation procedure promoted better understanding of the relation between GANs and LMs, and could accelerate progress in GAN-based text generation.

arxiv1810.12686-f1.png

[Image source. Click image to open in new window.]


arxiv1810.12686-t1+f2.png

[Image source. Click image to open in new window.]


Language Models:

Additional Reading

  • Transformer-XL: Language Modeling with Longer-Term Dependency (ICLR 2019) [discussion;sp [mentioned]

    “We propose a novel architecture, Transformer-XL, for language modeling with longer-term dependency. Our main technical contributions include introducing the notion of recurrence in a purely self-attentive model and deriving a novel positional encoding scheme. Transformer-XL is the first self-attention model that achieves substantially better results than RNNs on both character-level and word-level language modeling. Transformer-XL is also able to model longer-term dependency than RNNs and Transformer.”

  • Improving Sentence Representations with Multi-view Frameworks

    “… we present two multi-view frameworks for learning sentence representations in an unsupervised fashion. One framework uses a generative objective and the other a discriminative one. In both frameworks, the final representation is an ensemble of two views, in which one view encodes the input sentence with a recurrent neural network (RNN) and the other view encodes it with a simple linear model. We show that, after learning, the vectors produced by our multi-view frameworks provide improved representations over their single-view learned counterparts, and the combination of different views gives representational improvement over each view and demonstrates solid transferability on standard downstream tasks.”

  • BioSentVec: creating sentence embeddings for biomedical texts (Oct 2018) [dataset]

    • “… Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings.”

    • BioWordVec: biomedical word embeddings with fastText. We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms).

    BioSentVec [1]: biomedical sentence embeddings with sent2vec. We applied sent2vec to compute the 700-dimensional sentence embeddings. We used the bigram model and set window size to be 20 and negative examples 10.

  • Do RNNs learn human-like abstract word order preferences? (Nov 2018) [code]

    “RNN language models have achieved state-of-the-art results on various tasks, but what exactly they are representing about syntax is as yet unclear. Here we investigate whether RNN language models learn humanlike word order preferences in syntactic alternations. We collect language model surprisal [← sic] scores for controlled sentence stimuli exhibiting major syntactic alternations in English: heavy NP shift, particle shift, the dative alternation, and the genitive alternation. We show that RNN language models reproduce human preferences in these alternations based on NP length, animacy, and definiteness. We collect human acceptability ratings for our stimuli, in the first acceptability judgment experiment directly manipulating the predictors of syntactic alternations. We show that the RNNs’ performance is similar to the human acceptability ratings and is not matched by an n-gram baseline model. Our results show that RNNs learn the abstract features of weight, animacy, and definiteness which underlie soft constraints on syntactic alternations.”

[Table of Contents]

RNN, CNN, or Self-Attention?

In the course of writing this REVIEW and in my other readings I often encountered discussions of RNN vs. CNN. vs. self-attention architectures in regard to NLP and language models. Here, I collate and summarize/paraphrase some of those observations; green-colored URL are internal hyperlinks to discussions of those items elsewhere in this REVIEW.

  • Dissecting Contextual Word Embeddings: Architecture and Representation (Aug 2018) discussed contextual word representations derived from pretrained bidirectional language models (biLM), showing that Deep biLM learned a rich hierarchy of contextual information that was captured in three disparate types of network architectures: LSTM, CNN, or self attention. In every case, the learned representations represented a rich hierarchy of contextual information throughout the layers of the network.

  • Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers (Jun 2018) studied the role of linguistic context in predicting quantifiers (“few”, “all”). Overall, LSTM were the best-performing architectures, with CNN showing some potential in the handling of longer sequences.

    arxiv-1806.00354d.png

    [Image source. Click image to open in new window.]


  • Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (May 2018) discussed LSTM vs. self attention. In a very interesting ablation study, they presented an alternate view to explain the success of LSTM: LSTM are a hybrid of a simple RNN (S-RNN) and a gated model that dynamically computes weighted sums of the S-RNN outputs. Results across four major NLP tasks (language modeling, question answering, dependency parsing, and machine translation) indicated that LSTM suffer little to no performance loss when removing the S-RNN. This provided evidence that the gating mechanism was doing the heavy lifting in modeling context. They further ablated the recurrence in each gate and found that this incurred only a modest drop in performance, indicating that the real modeling power of LSTM stems from their ability to compute element-wise weighted sums of context-independent functions of their inputs. This realization allowed them to mathematically relate LSTM and other gated RNNs to attention-based models. Casting an LSTM as a dynamically-computed attention mechanism enabled the visualization of how context is used at every timestep, shedding light on the inner workings of the relatively opaque LSTM.

  • While RNN are a cornerstone in learning latent representations from long text sequences, a purely convolutional and deconvolutional autoencoding framework may be employed, as described in Deconvolutional Paragraph Representation Learning (Sep 2018). That paper addressed the issue that the quality of sentences during RNN-based decoding (reconstruction) decreased with the length of the text. Compared to RNN, their framework was better at reconstructing and correcting long paragraphs. Note Table 1 in their paper (showing paragraphs reconstructed
    Source
    by LSTM and CNN, as well as the vastly superior BLEU / ROUGE scores in Table 2
    Source
    ); there is also additional NLP-related LSTM vs. CNN discussion in this Hacker News thread.

  • Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition (Aug 2018) compared the use of LSTM based and CNN based character level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus showed that the use of either type of character level word embeddings in conjunction with the BiLSTM-CRF models led to comparable state of the art performance. However, the models using CNN based character level word embeddings had a computational performance advantage, increasing training time over word based models by 25% while the LSTM based character level word embeddings more than doubled the required training time.

    arxiv-1808.08450a.png

    [Image source. Click image to open in new window.]


    arxiv-1808.08450b.png

    [Image source. Click image to open in new window.]


  • Recently, non-recurrent architectures (convolutional; self-attentional) have outperformed RNN in neural machine translation. CNN and self-attentional networks can connect distant words via shorter network paths than RNN, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument had not been tested empirically, nor had alternative explanations for their strong performance been explored in-depth. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (Aug 2018) hypothesized that the strong performance of CNN and self-attentional networks could be due to their ability to extract semantic features from the source text. They evaluated RNN, CNN and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Experimental results showed that self-attentional networks and CNN did not outperform RNN in modeling subject-verb agreement over long distances, and that self-attentional networks performed distinctly better than RNN and CNN on word sense disambiguation.

    arxiv-1808.08946d.png

    [Image source. Click image to open in new window.]


  • Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (Aug 2018) proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance. They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy. All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

    The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction. In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

  • The architecture proposed in QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension (Apr 2018) did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD1.1 dataset their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data.

    • Likewise, a later paper, A Fully Attention-Based Information Retriever (Oct 2018) that also relied entirely on a (convolutional and/or) self-attentional model achieved competitive results on SQuAD1.1 while having fewer parameters and being faster at both learning and inference than rival (largely RNN-based) methods. Their FABIR model was significantly outperformed by the highly similar – and non-cited – competing QANet model.

  • Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension (Jun 2018), performed as well as QANet. Based on a Bi-LSTM, Reinforced Mnemonic Reader is an enhanced attention reader – suggesting perhaps that the improvements in QANet, Reinforced Mnemonic Reader, and the work described in Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (second preceding paragraph) was due to the attention mechanisms, rather than the RNN or CNN architectures.

  • Likewise, Constituency Parsing with a Self-Attentive Encoder (May 2018) [code] demonstrated that replacing a LSTM encoder with a self-attentive architecture could lead to improvements to a state of the art discriminative constituency parser. The use of attention made explicit the manner in which information was propagated between different locations in the sentence; for example, separating positional and content information in the encoder led to improved parsing accuracy. They evaluated a version of their model that used ELMo as the sole lexical representation, using publicly available ELMo weights. Trained on the Penn Treebank, their parser attained
    Source
    93.55 $\small F_1$ without the use of any external data, and 95.13 $\small F_1$ when using pre-trained word representations. The gains came not only from incorporating more information (such as subword features or externally trained word representations), but also from structuring the architecture to separate different kinds of information from each other.

    arxiv1805.01052e.png

    [Image source. Click image to open in new window.]


    arxiv1805.01052d.png

    [Image source. Click image to open in new window.]


[Table of Contents]

LSTM, Attention and Gated (Recurrent) Units

Here I collate and summarize/paraphrase gated unit mechanism-related discussion from elsewhere in this REVIEW.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Sep 2014) by Kyunghyun Cho et al. [Yoshua Bengio] described a RNN encoder-decoder for statistical machine translation, that introduced a new type of hidden unit ($\small \mathcal{f}$ in the equation, below) – the gated recurrent unit (GRU) – that was motivated by the LSTM unit but was much simpler to compute and implement.

    Recurrent Neural Networks. A recurrent neural network (RNN) is a neural network that consists of a hidden state $\small \mathbf{h}$ and an optional output $\small \mathbf{y}$ which operates on a variable-length sequence $\small \mathbf{x} = (x_1, \ldots, x_T)$. At each time step $\small t$, the hidden state $\small \mathbf{h_{\langle t \rangle}}$ of the RNN is updated by

      $\small \mathbf{h_{\langle t \rangle}} = f (\mathbf{h_{\langle t-1 \rangle}}, x_t)$,
    where $\small \mathcal{f}$ is a non-linear activation function. $\small \mathcal{f}$ may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997).

    Hidden Unit that Adaptively Remembers and Forgets. ... we also propose a new type of hidden unit ($\small f$ in the equation, above) that has been motivated by the LSTM unit but is much simpler to compute and implement. [The LSTM unit has a memory cell and four gating units that adaptively control the information flow inside the unit, compared to only two gating units in the proposed hidden unit.]

    This figure shows the graphical depiction of the proposed hidden unit:

    arxiv1406.1078-f2.png
    [Image source. Click image to open in new window.]

    "In this formulation [see Section 2.3 in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation for details], when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

    "On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit (Bengio et al., 2013). As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active. ..."

In their very highly cited paper Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014; updated May 2016), Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio employed their “gated hidden unit” (a GRU) – introduced by Cho et al. (2014) (above) – for neural machine translation. Their model consisted of a forward and backward pair of RNN (BiRNN) for the encoder, and a decoder that emulated searching through a source sentence while decoding a translation.

From Appendix A in that paper:

“For the activation function $\small f$ of an RNN, we use the gated hidden unit recently proposed by Cho et al. (2014a). The gated hidden unit is an alternative to the conventional simple units such as an element-wise $\small \text{tanh}$. This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies. This is made possible by having computation paths in the unfolded RNN for which the product of derivatives is close to 1. These paths allow gradients to flow backward easily without suffering too much from the vanishing effect. It is therefore possible to use LSTM units instead of the gated hidden unit described here, as was done in a similar context by Sutskever et al. (2014).”

Discussed in Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2018):

“The attention mechanism has been a breakthrough in neural machine translation (NMT) in recent years. This mechanism calculates how much attention the network should give to each source word to generate a specific translated word. The context vector calculated by the attention mechanism mimics the syntactic skeleton of the input sentence precisely given a sufficient number of examples. Recent work suggests that incorporating explicit syntax alleviates the burden of modeling grammatical understanding and semantic knowledge from the model.”

GRUs have fewer parameters than LSTM, as they lack an output gate. A GRU has two gates (an update gate and reset gate), while a RNN has three gates (update, forget and output gates). The GRU update gate decides on how much of information from the past should be let through, while the reset gate decides on how much of information from the past should be discarded. What motivates this? Although RNNs can theoretically capture long-term dependencies, they are actually very hard to train to do this [see this discussion]. GRUs are designed to have more persistent memory, thereby making it easier for RNNs to capture long-term dependencies. Even though computationally a GRU is more efficient than an LSTM network, due to the reduction of gates it still comes second to LSTM network in terms of performance. Therefore, GRUs are often used when we need to train faster, and we don’t have much computational power.

Counting in Language with RNNs (Oct 2018) examined a possible reason for LSTM outperforming GRU on language modeling and more specifically machine translation. They hypothesized that this had to do with counting – a consistent theme across the literature of long term dependence, counting, and language modeling for RNNs. Using the simplified forms of language – context-free and context-sensitive languages – they showed how the LSTM performs its counting based on their cell states during inference, and why the GRU cannot perform as well.

“As argued in the Introduction, we believe there is a lot of evidence supporting the claim that success at language modeling requires an ability to count. Since there is empirical support for the fact that the LSTM outperforms the GRU in language related tasks, we believe that our results showing how fundamental this inability to count is for the GRU, we believe we make a contribution to the study of both RNNs and their success on language related tasks. Our experiments along with the other recent paper by Weiss et al. [2017], show almost beyond reasonable doubt that the GRU is not able to count as well as the LSTM, furthering our hypothesis that there is a correlation between success at performance on language related tasks and the ability to count.”

Germane to this subsection (“LSTM, Attention and Gated (Recurrent) Units”) is the excellent companion blog post to When Recurrent Models Don't Need To Be Recurrent, in which coauthor John Miller discusses a very interesting paper by Dauphin et al., Language Modeling with Gated Convolutional Networks (Sep 2017). Some highlights from that paper:

  • Gating has been shown to be essential for recurrent neural networks to reach state-of-the-art performance. Our gated linear units reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities (Section 5.2). We show that gated convolutional networks outperform other recently published language models such as LSTMs trained in a similar setting on the Google Billion Word Benchmark (Chelba et al., 2013). …

  • “Gating mechanisms control the path through which information flows in the network and have proven to be useful for recurrent neural networks. LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep. In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers. We show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. …

  • “Gated linear units are a simplified gating mechanism based on the work of Dauphin & Grangier [Predicting distributions with Linearizing Belief Networks (Nov 2015; updated May 2016)] for non-deterministic gates that reduce the vanishing gradient problem by having linear units coupled to the gates. This retains the non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. … We compare the different gating schemes experimentally in Section 5.2 and we find gated linear units allow for faster convergence to better perplexities.”

  • “The unlimited context offered by recurrent models is not strictly necessary for language modeling.” 

    “In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

  • “The ‘infinite memory’ advantage of RNNs is largely absent in practice.” 

    As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM $\small n$-gram model with $\small n=13$ words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

  • “Recurrent models trained in practice are effectively feedforward.” 

    This could happen either because truncated backpropagation time cannot learn patterns significantly longer than $\small k$ steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

Gated-Attention Readers for Text Comprehension (Jun 2016; updated Apr 2017) [code], by Ruslan Salakhutdinov and colleagues, employed the attention mechanism introduced by Yoshua Bengio and colleagues (Neural Machine Translation by Jointly Learning to Align and Translate) in their model, the Gated-Attention Reader (GA Reader). The GA Reader integrated a multi-hop architecture with a novel attention mechanism, which was based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enabled the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtained state of the art results on three benchmarks for this task. The effectiveness of multiplicative interaction was demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention.

“Deep learning models have been shown to outperform traditional shallow approaches on text comprehension tasks. The success of many recent models can be attributed primarily to two factors:

  1. Multi-hop architectures allow a model to scan the document and the question iteratively for multiple passes.
  2. Attention mechanisms, borrowed from the machine translation literature, allow the model to focus on appropriate subparts of the context document.

Intuitively, the multi-hop architecture allows the reader to incrementally refine token representations, and the attention mechanism re-weights different parts in the document according to their relevance to the query.

… In this paper, we focus on combining both in a complementary manner, by designing a novel attention mechanism which gates the evolving token representations across hops. … More specifically, unlike existing models where the query attention is applied either token-wise or sentence-wise to allow weighted aggregation, the Gated-Attention module proposed in this work allows the query to directly interact with each dimension of the token embeddings at the semantic-level, and is applied layer-wise as information filters during the multi-hop representation learning process. Such a fine-grained attention enables our model to learn conditional token representations with respect to the given question, leading to accurate answer selections.”

arxiv1606.01549-f1.png

[Image source. Click image to open in new window.]


arxiv1606.01549-f3.png

[Image source. Click image to open in new window.]


arxiv1606.01549-t3.png

[Image source. Click image to open in new window.]


A recent review, Comparative Analysis of Neural QA Models on SQuAD (Jun 2018), reported that models based on a gated attention mechanism (R-Net ), or a GRU (DocQA ), performed well across a variety of tasks.

  • “Gated Self-Matching Networks (R-Net). This model, proposed by Wang et al. (2017), is a multi-layer end-to-end neural network whose novelty lies in the use of a gated attention mechanism so as to give different levels of importance to different passage parts. It also uses self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contains character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer. The implementation we used had some minor changes for increased efficiency.”

arxiv1806.06972-tables1+2.png

[Image source. Click image to open in new window.]


Gated Self-Matching Networks (R-Net) (mentioned above) – proposed by Wang et al. (2017) (2017) [code] – are a multilayer end-to-end neural networks whose novelty lies in the use of a gated attention mechanism so as to give different levels of importance to different passage parts. It also uses self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contains character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer. The implementation we used 3 had some minor changes for increased efficiency.”

“… we present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage. We first match the question and passage with gated attention-based recurrent networks to obtain the question-aware passage representation. Then we propose a self-matching attention mechanism to refine the representation by matching the passage against itself, which effectively encodes information from the whole passage. We finally employ the pointer networks to locate the positions of answers from the passages. We conduct extensive experiments on the SQuAD dataset. The single model achieves 71.3% on the evaluation metrics of exact match on the hidden test set, while the ensemble model further boosts the results to 75.9%. At the time of submission of the paper, our model holds the first place on the SQuAD Leaderboard for both single and ensemble model.”

  • “We choose to use Gated Recurrent Unit (GRU) (Cho et al., 2014) in our experiment since it performs similarly to LSTM (Hochreiter and Schmidhuber, 1997) but is computationally cheaper. … We propose a gated attention-based recurrent network to incorporate question information into passage representation. It is a variant of attention-based recurrent networks, with an additional gate to determine the importance of information in the passage regarding a question.”

Wang2017-f1.png

[Image source. Click image to open in new window.]


Wang2017-t2.png

[Image source. Click image to open in new window.]


Wang2017-f2.png

[Image source. Click image to open in new window.]


Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension (Jun 2017; updated Jan 2018), a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. The Microsoft Research approach employed bidirectional gated recurrent units (BiGRU) instead of RNN (Bi-LSTM). The answer extraction model was first employed to predict the most important sub-spans from the passage as evidence, which the answer synthesis model took as additional features along with the question and passage to further elaborate the final answers. They built the answer extraction model for single passage reading comprehension, and proposed an additional task of ranking the single passages to help in answer extraction from multiple passages.

Facebook AI Research recently (May 2018) developed a seq2seq based self-attention mechanism to model long-range context (Hierarchical Neural Story Generation), demonstrated via story generation. They found that standard seq2seq models applied to hierarchical story generation were prone to degenerating into language models that paid little attention to the writing prompt (a problem noted in other domains, such as dialogue response generation). They tackled the challenges of story-telling with a hierarchical model, which first generated a sentence called “the prompt” (describing the topic for the story), and then “conditioned” on this prompt when generating the story. Conditioning on the prompt or premise made it easier to generate consistent stories, because they provided grounding for the overall plot. It also reduced the tendency of standard sequence models to drift off topic. To improve the relevance of the generated story to its prompt, they adopted a GRU-based “fusion mechanism,” which pretrains a language model and subsequently trains a seq2seq model with a gating mechanism that learns to leverage the final hidden layer of the language model during seq2seq training. The model showed, for the first time, that fusion mechanisms could help seq2seq models build dependencies between their input and output.

  • The gated self-attention mechanism allowed the model to condition on its previous outputs at different time-scales (i.e., to model long-range context).

  • Similar to Google’s Transformer, Facebook AI Research used multi-head attention to allow each head to attend to information at different positions. However, the queries, keys and values in their model were not given by linear projections (see Section 3.2.2 in the Transformer paper), but by more expressive gated deep neural nets with gated linear unit activations: gating lent the self-attention mechanism crucial capacity to make fine-grained selections.

Researchers at Peking University (Junyang Lin et al.) recently developed a model that employed a Bi-LSTM decoder in a text summarization task [Global Encoding for Abstractive Summarization (Jun 2018)]. Their approach differed from a similar approach [not cited] by Richard Socher and colleagues at Salesforce, in that Lin et al. fed their encoder output at each time step into a convolutional gated unit, that with a self-attention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global source-side information. Self-attention encouraged the model to learn long-term dependencies, without creating much computational complexity. The gate (based on the generation from the CNN and self-attention module for the source representations from the RNN encoder) could perform global encoding on the encoder outputs. Based on the output of the CNN and self-attention, the logistic sigmoid function outputted a vector of value between 0 and 1 at each dimension. If the value was close to 0, the gate removed most of the information at the corresponding dimension of the source representation, and if it was close to 1 it reserved most of the information. The model thus performed neural abstractive summarization through a global encoding framework, which controlled the information flow from the encoder to the decoder based on the global information of the source context, generating summaries of higher quality while reducing repetition.

In October 2018 Myeongjun Jang and Pilsung Kang at Korea University presented Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition, which introduced their P-thought  model. P-thought employed a seq2seq structure with a gated recurrent unit (GRU) cell. The encoder transformed the sequence of words from an input sentence into a fixed-sized representation vector, whereas the decoder generated the target sentence based on the given sentence representation vector. The P-thought model had two decoders: when the input sentence was given, the first decoder, named “auto-decoder,” generated the input sentence as-is. The second decoder, named “paraphrase-decoder,” generated the paraphrase sentence of the input sentence.

Biomedical event extraction is a crucial task in biomedical text mining. As the primary forum for international evaluation of different biomedical event extraction technologies, the BioNLP Shared Task represents a trend in biomedical text mining toward fine-grained information extraction. The 2016 BioNLP Shared Task (BioNLP-ST 2016) proposed three tasks, in which the “Bacteria Biotope” (BB) event extraction task was added to the previous BioNLP-ST. Biomedical event extraction based on GRU integrating attention mechanism (Aug 2018) proposed a novel gated recurrent unit network framework (integrating an attention mechanism) for extracting biomedical events between biotopes and bacteria from the biomedical literature, utilizing the corpus from the BioNLP-ST 2016 Bacteria Biotope task. The experimental results showed that the presented approach could achieve an $\small F$-score of 57.42% in the test set, outperforming previous state of the art official submissions to BioNLP-ST 2016.

PMID30367569-f1.png

[Image source. Click image to open in new window.]


PMID30367569-f2.png

[Image source. Click image to open in new window.]


PMID30367569-t2+t3.png

[Image source. Click image to open in new window.]


LSTM, Attention and Gated (Recurrent) Units:

Additional Reading

[Table of Contents]

Question Answering and Reading Comprehension

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accept queries in natural language and return accurate answers quickly and efficiently.

Question answering (a natural language understanding problem) and reading comprehension (the task of answering a natural language question about a paragraph) are of considerable interest in NLP motivating, for example, the Human-Computer Question Answering Competition (in the NIPS 2017 Competition Track), and the BioASQ Challenge in the BioNLP domain. Unlike generic text summarization, reading comprehension systems facilitate the answering of targeted questions about specific documents, efficiently extracting facts and insights (How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks).

The Stanford Question Answering Dataset / Leaderboard (SQuAD: developed at Stanford University) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment or span of text from the corresponding reading passage (or, the question might be unanswerable). There has been a rapid progress on the SQuAD dataset, and early in 2018 engineered systems started achieving and surpassing human level accuracy on the SQuAD1.1 Leaderboard  (discussion: AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems). SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written by crowdworkers to look adversarially similar to answerable ones (ACL 2018 Best Short Paper: Know What You Don’t Know: Unanswerable Questions for SQuAD;  [project/code]). To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Carnegie Mellon University/Google Brain’s QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension” (Apr 2018) [OpenReviewdiscussion/TensorFlow implementationcode] proposed a method (QANet) that did not require RNN: its encoder consisted exclusively of convolution and self-attention, where convolution modeled local interactions and self-attention modeled global interactions. On the SQuAD dataset (SQuAD1.1: see the leaderboard), their model was 3-13x faster in training and 4-9x faster in inference while achieving accuracy equivalent to recurrent models – allowing them to train their model with much more data.

  • Note that A Fully Attention-Based Information Retriever (Oc 2018) – which failed to cite this earlier, more performant QANet paper/work which scores much higher on the SQuAD1.1 Leaderboard – also employed an entirely convolutional and/or self-attention architecture, which performed satisfactorily on the SQuAD1.1 dataset and was faster to train than RNN-based approaches.

arxiv1804.09541.png

[Image source. Click image to open in new window.]


adversarial_SQuAD.png

[Image sources: Table 6Table 3. Click image to open in new window.]


Another model, Reinforced Mnemonic Reader for Machine Reading Comprehension (May 2017; updated Jun 2018) [non-author implementations: MnemonicReader | MRC | MRC-models] performed as well as QANet, outperforming previous systems by over 6% in terms of both Exact Match (EM) and $\small F_1$ metrics on two adversarial SQuAD datasets. Reinforced Mnemonic Reader, based on Bi-LSTM, is an enhanced attention reader with two main contributions: (i) a reattention mechanism, introduced to alleviate the problems of attention redundancy and deficiency in multi-round alignment architectures, and (ii) a dynamic-critical reinforcement learning approach, to address the convergence suppression problem that exists in traditional reinforcement learning methods.

arxiv1705.02798c.png

[Image source. Click image to open in new window.]


arxiv1705.02798e.png

[Image source. Click image to open in new window.]


In April 2018 IBM Research introduced a new dataset for reading comprehension (DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension)  [projectdata]. DuoRC is a large scale reading comprehension (RC) dataset of 186K human-generated QA pairs created from 7,680 pairs of parallel movie plots taken from Wikipedia and IMDb. By design, DuoRC ensures very little or no lexical overlap between the questions created from one version and segments containing answers in the other version. Essentially, this is a paraphrase dataset, which should be very useful for training reading comprehension models. For example, the authors observed that state of the art neural reading comprehension models that achieved near human performance on the SQuAD dataset exhibited very poor performance on the DuoRC dataset ($\small F_1$ scores of 37.42% on DuoRC vs. 86% on SQuAD), opening research avenues in which DuoRC could complement other RC datasets exploration of novel neural approaches to studying language understanding.

DuoRC might be a useful dataset for training sentence embedding approaches to natural language tasks such as machine translation, document classification, sentiment analysis, etc. In this regard, note that the Conclusions section in Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition stated: “The main limitation of the current work is that there are insufficient paraphrase sentences for training the models. ”

arxiv1804.07927.png

[Image source. Click image to open in new window.]


In a very thorough and thoughtful analysis, Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension (Aug 2018) [code] proposed a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieved state of the art results on the MovieQA question answering dataset. To investigate the limitations of their model as well as the behavioral difference between convolutional and recurrent neural networks, they generated adversarial examples to confuse the model and compare to human performance.

arxiv-1808.08744.png

[Image source. Click image to open in new window.]


arxiv-1808.08744b.png

[Image source. Click image to open in new window.]


Highlights from this work are [substantially] paraphrased here:

  • They trained 11 models with different random initializations for both the CNN and RNN-LSTM aggregation function and formed majority-vote ensembles of the nine models with the highest validation accuracy.

  • All the hierarchical single and ensemble models outperformed the previous state of the art on both the validation and test sets. With a test accuracy of 85.12, the RNN-LSTM ensemble achieved a new state of the art that is more than five percentage points above the previous best result. Furthermore, the RNN-LSTM aggregation function is superior to aggregation via CNNs, improving the validation accuracy by 1.5 percentage points.

  • The hierarchical structure was crucial for the model’s success. Adding it to the CNN that operates only at word level caused a pronounced improvement on the validation set. It seems to be the case that the hierarchical structure helps the model to gain confidence, causing more models to make the correct prediction.

  • The sentence attention allowed them to get more insight into the models’ inner state. For example, it allowed them to check whether the model actually focused on relevant sentences in order to answer the questions. Both model variants [CNN; RNN-LSTM] paid most attention to the relevant plot sentences for 70% of the cases. Identifying the relevant sentences was an important success factor: relevant sentences were ranked highest only in 35% of the incorrectly solved questions.

  • Textual entailment was required to solve 60% of the questions …

  • The process of elimination and heuristics proved essential to solve 44% of the questions …

  • Referential knowledge was presumed in 36% of the questions …

  • Furthermore, it was apparent that many questions expected a combination of various reasoning skills.

  • In general, RNN-LSTM models outperformed CNN models, but their results for sentence-level black-box [adversarial] attacks indicated they might share the same weaknesses.

  • Finally, their intensive analysis on the differences between the model and human inference suggest that both models seem to learn matching patterns to select the right answer rather than performing plausible inferences as humans do. The results of these studies also imply that other human like processing mechanism such as referential relations, implicit real world knowledge, i.e., entailment, and answer by elimination via ranking plausibility Hummel and Holyoak, 2005 should be integrated in the system to further advance machine reading comprehension.




Collectively, those publications indicate the difficulty in achieving robust reading comprehension, and the need to develop new models that understand language more precisely. Addressing this challenge will require employing more difficult datasets (like SQuAD2.0) for various tasks, evaluation metrics that can distinguish real intelligent behavior from shallow pattern matching, a better understanding of the response to adversarial attack, and the development of more sophisticated models that understand language at a deeper level.

  • The need for more challenging datasets was echoed in the “Creating harder datasets” subsection in Sebastian Ruder’s ACL 2018 Highlights summary.

    In order to evaluate under such settings, more challenging datasets need to be created. Yejin Choi argued during the RepL4NLP panel discussion (a summary can be found here) that the community pays a lot of attention to easier tasks such as SQuAD or bAbI, which are close to solved. Yoav Goldberg even went so far as to say that “SQuAD is the MNIST of NLP ”.

    Instead, we should focus on solving harder tasks and develop more datasets with increasing levels of difficulty. If a dataset is too hard, people don’t work on it. In particular, the community should not work on datasets for too long as datasets are getting solved very fast these days; creating novel and more challenging datasets is thus even more important. Two datasets that seek to go beyond SQuAD for reading comprehension were presented at the conference:

    Richard Socher also stressed the importance of training and evaluating a model across multiple tasks during his talk during the Machine Reading for Question Answering workshop. In particular, he argues that NLP requires many types of reasoning, e.g. logical, linguistic, emotional, etc., which cannot all be satisfied by a single task.

  • Read + Verify: Machine Reading Comprehension with Unanswerable Questions (Sep 2018) proposed a novel read-then-verify system that combined a base neural reader with a sentence-level answer verifier trained to (further) validate if the predicted answer was entailed by input snippets. They also augmented their base reader with two auxiliary losses to better handle answer extraction and no-answer detection respectively, and investigated three different architectures for the answer verifier. On the SQuAD2.0 dataset their system achieved a $\small F_1$ score of 74.8 on the development set (ca. August 2018), outperforming the previous best published model by more than 7 points (and the best reported model by ~3.5 points (2018-08-20: SQuAD2.0 Leaderboard).

    arxiv1808.05759a.png

    [Image source. Click image to open in new window.]


    arxiv1808.05759b.png

    [Image source. Click image to open in new window.]


In addition to SQuAD2.0 and DuoRC, other recent datasets related to question-answering and reasoning include:

  • Facebook AI Research’s bAbI, a set of 20 tasks for testing text understanding and reasoning described in detail in the paper by Jason Weston et al., Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks (Dec 2015).

  • University of Pennsylvania’s MultiRC: Reading Comprehension over Multiple Sentences;  [project (2018) code], a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. The goal of this dataset is to encourage the research community to explore approaches that can do more than sophisticated lexical-level matching.

  • Allen Institute for Artificial Intelligence (AI2)’s Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Mar 2018) [projectcode], which presented a new question set, text corpus</a, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI (Stanford Natural Language Inference Corpus). As noted in their Conclusions:

    “Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community.”

    • ARC was recently used in Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Scientific Question Answering (Oct 2018) by authors at UC San Diego and Microsoft AI Research. Existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper, the authors proposed a retriever-reader model that learned to attend on [via self-attention layers] essential terms during the question answering process via an essential-term-aware “retriever” which first identified the most important words in a question, then reformulated the queries and searches for related evidence, and an enhanced “reader” to distinguish between essential terms and distracting words to predict the answer. On the ARC dataset their model outperformed the existing state of the art [e.g., BiDAF] by 8.1%.

      arxiv1808.09492b.png

      [image source. click image to open in new window.]


      arxiv1808.09492a.png

      [image source. click image to open in new window.]


      arxiv1808.09492c.png

      [image source. click image to open in new window.]




Among the many approaches to QA applied to textual sources, the attentional long short-term memory (LSTM)-based and Bi-LSTM, memory based implementations of Richard Socher (SalesForce) are particularly impressive:

  • Ask Me Anything: Dynamic Memory Networks for Natural Language Processing (Jun 2015; updated Mar 2016) by Richard Socher (MetaMind) introduced the Dynamic Memory Network (DMN), a neural network architecture that processed input sequences and questions, formed episodic memories, and generated relevant answers. Questions triggered an iterative attention process that allowed the model to condition its attention on the inputs and the result of previous iterations. These results were then reasoned over in a hierarchical recurrent sequence model to generate answers. [For an good overview of the DMN approach, see slides 39-47 in Neural Architectures with Memory.]

    arxiv-1506.07285.png

    [Image source. Click image to open in new window.]


  • Based on analysis of DMN (above), in 2016 Richard Socher/MetaMind (later acquired by SalesForce) proposed several improvements to the DMN memory and input modules. Their DMN+ model (Dynamic Memory Networks for Visual and Textual Question Answering (Mar 2016) [discussion]) improved the state of the art on visual and text question answering datasets, without supporting fact supervision. Non-author DMN+ code available on GitHub includes a Theano (Improved-Dynamic-Memory-Networks-DMN-plus) and TensorFlow (Dynamic-Memory-Networks-in-TensorFlow) implementations.

    arxiv-1603.01417a.png

    [Image source. Click image to open in new window.]


    arxiv-1603.01417b.png

    [Image source. Click image to open in new window.]


    arxiv-1603.01417c.png

    [Image source. Click image to open in new window.]


  • Later in 2016, Dynamic Coattention Networks for Question Answering (Nov 2016; updated Mar 2018) [non-author code, on SQuAD2.0] by Richard Socher and colleagues at SalesForce introduced the Dynamic Coattention Network (DCN) for QA. DCN first fused co-dependent representations of the question and the document in order to focus on relevant parts of both, then a dynamic pointing decoder iterated over potential answer spans. This iterative procedure enabled the model to recover from the initial local maxima that correspond to incorrect answers. On the Stanford question answering dataset, a single DCN model improved the previous state of the art from 71.0% $\small F_1$ to 75.9%, while a DCN ensemble obtained a 80.4% $\small F_1$ score.

    arxiv1611.01604a.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604b.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604c.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604d.png

    [Image source. Click image to open in new window.]


    arxiv1611.01604e.png

    [Image source. Click image to open in new window.]


  • Efficient and Robust Question Answering from Minimal Context Over Documents (May 2018) studied the minimal context required to answer a question and found that most questions in existing datasets could be answered with a small set of sentences. The authors (Socher and colleagues) proposed a simple sentence selector to select the minimal set of sentences to feed into the QA model, which allowed the system to achieve significant reductions in training (up to 15 times) and inference times (up to 13 times), with accuracy comparable to or better than the state of the art on SQuAD, NewsQA, TriviaQA and SQuAD-Open. Furthermore, the approach was more robust to adversarial inputs.

    Note the sentence selector in Fig. 2(a):

    “For each QA model, we experiment with three types of inputs. First, we use the full document (FULL). Next, we give the model the oracle sentence containing the groundtruth answer span (ORACLE). Finally, we select sentences using our sentence selector (MINIMAL), using both $\small \text{Top k}$ and $\small \text{Dyn}$. We also compare this last method with TF-IDF method for sentence selection, which selects sentences using n-gram TF-IDF distance between each sentence and the question.”

    arxiv1805.08092a.png

    [Image source. Click image to open in new window.]


    arxiv1805.08092b.png

    [Image source. Click image to open in new window.]


    arxiv1805.08092c.png

    [Image source. Click image to open in new window.]


    arxiv1805.08092d.png

    [Image source. Click image to open in new window.]


In a significant body of work, The Natural Language Decathlon: Multitask Learning as Question Answering (Jun 2018) [codeproject], Richard Socher and colleagues at SalesForce presented a NLP challenge spanning 10 tasks,

  • question answering
  • machine translation
  • summarization
  • natural language inference
  • sentiment analysis
  • semantic role labeling
  • zero-shot relation extraction
  • goal-oriented dialogue
  • semantic parsing
  • commonsense pronoun resolution

    arxiv1806.08730-f1.png

    [Image source. Click image to open in new window.]


… as well as a new Multitask Question Answering Network (MQAN) model [code here and here] that jointly learned all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. The MQAN model took in a question and context document, encoded both with a Bi-LSTM, used dual coattention to condition representations for both sequences on the other, compressed all of this information with another two Bi-LSTM, applied self-attention to collect long-distance dependency, and then used a final two Bi-LSTM to get representations of the question and context. The multi-pointer-generator decoder used attention over the question, context, and previously outputted tokens to decide whether to copy from the question, copy from the context, or generate the answer from a limited vocabulary. MQAN showed improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification.

arxiv1806.08730-fig2.png

[Image source. Click image to open in new window.]


arxiv1806.08730-fig3.png

[Image source. Click image to open in new window.]


arxiv1806.08730-t1.png

[Image source. Click image to open in new window.]


arxiv1806.08730-t2.png

[Image source. Click image to open in new window.]


Understandably, Socher’s work has generated much interest among NLP and ML practitioners, leading to the acquisition of his startup, MetaMind, by Salesforce for $32.8 million in 2016 (Salesforce Reveals It Spent $75 Million on the Three Startups It Bought Last Quarter | Salesforce just bought a machine learning startup that was backed by its CEO Marc Benioff). While those authors will not release the code (per a comment by Richard Socher on reddit), using the search term “1506.07285” there appears to be four repositories on GitHub that attempt to implement his Ask Me Anything: Dynamic Memory Networks for Natural Language Processing model, while a GitHub search for “dynamic memory networks” or “DMN+” returns numerous repositories.

The MemN2N architecture was introduced by Jason Weston (Facebook AI Research) in his highly-cited End-To-End Memory Networks paper [code;  non-author code herehere and here;  discussion here and here]. MemN2N, a recurrent attention model over a possibly large external memory, was trained end-to-end and hence required significantly less supervision during training, making it more generally applicable in realistic settings. The flexibility of the MemN2N model allowed the authors to apply it to tasks as diverse as synthetic question answering (QA) and language modeling (LM). For QA the approach was competitive with memory networks but with less supervision; for LM their approach demonstrated performance comparable to RNN and LSTM on the Penn Treebank and Text8 datasets.

While Weston’s MemN2N model was surpassed (accuracy and tasks completed) on the bAbI English 10k dataset by Socher’s DMN+ – see the “E2E” (End to End) and DMN+ columns in Table 2
Source
in the DMN+ paper – code is available (links above) for the MemN2N model.

A precaution with high-performing but heavily engineered systems is domain specificity: How well do those models transfer to other applications? I encountered this issue in my preliminary work (not shown) where I carefully examined the Turku Event Extraction System [TEES 2.2: Biomedical Event Extraction for Diverse Corpora (2015)]. TEES preformed well but was heavily engineered to perform well in the various BioNLP Challenge tasks in which it participated. Likewise, a June 2018 comment in the AllenNLP GitHub repository, regarding end-to-end memory networks, is of interest:

  • “Why are you guys not using *Dynamic Memory Networks in any of your QA solutions?*

    I’m not a huge fan of the models called “memory networks” – in general they are too tuned to a completely artificial task, and they don’t work well on real data. I implemented the end-to-end memory network ”, for instance, and it has three separate embedding layers (which is absolutely absurd if you want to apply it to real data).

    @DeNeutoy implemented the DMN+. It’s not as egregious as the E2EMN [end-to-end memory network], but still, I’d look at actual papers, not blogs, when deciding what methods actually work. E.g., are there any memory networks on the SQuAD Leaderboard (https://rajpurkar.github.io/SQuAD-explorer/)? On the TriviaQA leaderboard? On the leaderboard of any recent, popular dataset?

    To be fair, more recent “memory networks” have modified their architectures so they’re a lot more similar to things like the gated attention reader, which has actually performed well on real data. But, it sure seems like no one is using them to accomplish state of the art QA on real data these days.”

I believe that the “gated attention reader” mentioned in that comment (above) refers to Gated-Attention Readers for Text Comprehension] (Jun 2016; updated Apr 2017) by Ruslan Salakhutdinov and colleagues.



Retrieve-and-Read: Multi-task Learning of Information Retrieval and Reading Comprehension (Aug 2018) presented an interesting approach, “machine reading at scale” (MRS) wherein, given a question, a system retrieves passages relevant to the question from a corpus (IR: information retrieval) and then extracts the answer span from the retrieved passages (RC: reading comprehension). They proposed an approach that incorporated the IR and RC tasks using supervised multi-task learning in order that the IR component could be trained by considering answer spans. Their model directly minimized the joint loss of IR and RC in order that the IR component, which shares the hidden layers with the RC component, could be also trained with correct answer spans. In experiments on answering SQuAD questions using the Wikipedia as the knowledge source, their model achieved state of the art performance [on par with BiDAF].

arxiv1808.10628a.png

[Image source. Click image to open in new window.]


arxiv1808.10628b.png

[Image source. Click image to open in new window.]


  • “Our Retrieve-and-Read model is based on the bi-directional attention flow (BiDAF) model, which is a standard RC model. As shown in Figure 2, it consists of six layers: … [see image above] … We note that the RC component trained with single-task learning is essentially equivalent to BiDAF, except for the word embedding layer that has been modified to improve accuracy. … Note that the original BiDAF uses a pre-trained GloVe and also trains character-level embeddings by using a CNN in order to handle out-of-vocabulary (OOV) or rare words. Instead of using GloVe and CNN, our model uses fastText for the fixed pre-trained word vectors and removes character-level embeddings. The fastText model takes into account subword information and can obtain valid representations even for OOV words.”

In 2016 the Allen Institute for Artificial Intelligence introduced the Bi-Directional Attention Flow (BiDAF) framework (Bidirectional Attention Flow for Machine Comprehension (Nov 2016; updated Jun 2018) [projectcodedemo]). BiDAF was a multi-stage hierarchical process that represented context at different levels of granularity and used a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.

BiDAF was subsequently used in QA4IE: A Question Answering based Framework for Information Extraction (Apr 2018) [note Table 7
Source
; code], a novel information extraction (IE) framework that leveraged QA approaches to produce high quality relation triples across sentences from input documents, along with a knowledge base (Wikipedia Ontology) for entity recognition. QA4IE processed entire documents as a whole, rather than separately processing individual sentences. Because QA4IE was designed to produce sequence answers in IE settings, QA4IE was outperformed by BiDAF on the SQuAD dataset (Table 3
Source
in QA4IE). Conversely, QA4IE outperformed QA systems – including BiDAF – across 6 datasets in IE settings (Table 4
Source
in QA4IE).

BiDAF:

arxiv-1611.01603.png

[BiDAF. Image source. Click image to open in new window.]


QA4IE:

arxiv-1804.03396a.png

[QA4IE. Image source. Click image to open in new window.]


arxiv-1804.03396b.png

[QA4IE. Image source. Click image to open in new window.]


arxiv-1804.03396c.png

[QA4IE. Image source. Click image to open in new window.]


  • A major difference between question answering (QA) settings and information extraction settings is that in QA settings each query corresponds to an answer, while in the QA4IE framework the QA model takes a candidate entity-relation (or entity-property) pair as the query and it needs to tell whether an answer to the query can be found in the input text.

In other work relating to Bi-LSTM-based question answering, IBM Research and IBM Watson published a paper, Improved Neural Relation Detection for Knowledge Base Question Answering (May 2017), which focused on relation detection via deep residual Bi-LSTM networks to compare questions and relation names. The approach broke the relation names into word sequences for question-relation matching, built both relation level and word level relation representations, used deep BiLSTMs to learn different levels of question representations in order to match the different levels of relation information, and finally used a residual learning method for sequence matching. This made the model easier to train and resulted in more abstract (deeper) question representations, thus improving hierarchical matching. Several non-Microsoft implementations are available on GitHub (machine-comprehension; machine-reading-comprehension; and most recently, MSMARCO).

arxiv1704.06194-fig1.png

[Image source. Click image to open in new window.]


arxiv1704.06194-fig2.png

[Image source. Click image to open in new window.]


Making Neural QA as Simple as Possible but Not Simpler (Mar 2017; updated Jun 2017) introduced FastQA, a simple, context/type matching heuristic for extractive question answering. The paper posited that two simple ingredients are necessary for building a competitive QA system: (i) awareness of the question words while processing the context, and (ii) a composition function (such as recurrent neural networks) which goes beyond simple bag-of-words modeling. In follow-on work, these authors applied FastQA to the biomedical domain (Neural Domain Adaptation for Biomedical Question Answering;  [code]). Their system – which did not rely on domain-specific ontologies, parsers or entity taggers – achieved state of the art results on factoid questions, and competitive results on list questions.

arxiv1703.04816-fig2.png

[Image source. Click image to open in new window.]


arxiv1703.04816-fig1.png

[Image source. Click image to open in new window.]


A recent review, Comparative Analysis of Neural QA Models on SQuAD (Jun 2018), reported that models based on a gated attention mechanism (R-Net ), or a GRU (DocQA ), performed well across a variety of tasks.

Microsoft Research recently published S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension (Jun 2017; updated Jan 2018), a novel approach to machine reading comprehension for the MS-MARCO dataset that aimed to answer a question from multiple passages via an extraction-then-synthesis framework to synthesize answers from extraction results. Unlike the SQuAD dataset that aimed to answer a question with exact text spans in a passage, the MS-MARCO dataset defined the task as answering a question from multiple passages and the words in the answer are not necessary in the passages. The Microsoft Research approach employed bidirectional gated recurrent units (BiGRU) instead of RNN (Bi-LSTM). The answer extraction model was first employed to predict the most important sub-spans from the passage as evidence, which the answer synthesis model took as additional features along with the question and passage to further elaborate the final answers. They built the answer extraction model for single passage reading comprehension, and proposed an additional task of ranking the single passages to help in answer extraction from multiple passages.

arxiv1706.04815-fig1.png

[Image source. Click image to open in new window.]


arxiv1706.04815-fig2.png

[Image source. Click image to open in new window.]


arxiv1706.04815-fig3.png

[Image source. Click image to open in new window.]


arxiv1706.04815-tables2+3.png

[Image source. Click image to open in new window.]


Likewise (regarding evidence based answering), textual entailment with neural attention methods could also be applied; for example, as described in DeepMind’s Reasoning about Entailment with Neural Attention.

Robust and Scalable Differentiable Neural Computer for Question Answering (Jul 2018) [code] was designed as a general problem solver which could be used in a wide range of tasks. Their GitHub repository contains a implementation of an Advanced Differentiable Neural Computer (ADNC) for a more robust and scalable usage in question answering (differentiable neural computers are discussed elsewhere in this REVIEW). The ADNC was applied to the 20 bAbI QA tasks, with state of the art mean results, and to the CNN Reading Comprehension Task with passable results without any adaptation or hyperparameter tuning. Coauthor Jörg Franke’s Master’s Thesis contains additional detail.

arxiv-1807.02658c.png

[Image source. Click image to open in new window.]


In March 2018 Studio Ousia published a question answering model, Studio Ousia’s Quiz Bowl Question Answering System  [slidesmedia]. The embedding approach described in that paper was very impressive, with the ability to “reason” over passages such as the one shown in Table 1 [presented in the summary images, below]. Trained on their Wikipedia2Vec (Wikipedia) pretrained word embeddings, this model very convincingly won the Human-Computer Question Answering Competition (HCQA) at NIPS 2017, scoring more than double the combined human team score (465 to 200 points). A commercial entity, there was no code release.

arxiv1803.08652-table1.png

[Image source. Click image to open in new window.]


arxiv1803.08652-fig1+2.png

[Image source. Click image to open in new window.]


arxiv1803.08652-fig3.png

[Image source. Final: Human: 200 : Computer: 465 points. Click image to open in new window.]


In June 2018 Studio Ousia and colleagues at the Nara Institute of Science and Technology, RIKEN AIP, and Keio University published Representation Learning of Entities and Documents from Knowledge Base Descriptions  [code], which described TextEnt, a neural network model that learned distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, they trained their model to predict the entity that the document described, and map the document and its target entity close to each other in a continuous vector space. Their model, which was trained using a large number of documents extracted from Wikipedia, was evaluated using two tasks: (i) fine-grained entity typing, and (ii) multiclass text classification. The results demonstrated that their model achieved state of-the-art performance on both tasks.

arxiv1806.02960-fig1.png

[Image source. Click image to open in new window.]


arxiv1806.02960-fig2.png

[Image source. Click image to open in new window.]


arxiv1806.02960-table5.png

[Image source. Click image to open in new window.]


Based on the model architectures (above/below), it appears the Studio Ousia Quiz Bowel is based on their TextEnt work?

StudioOusia2018.png

[Image sourceImage source 2. Click image to open in new window.]


In Question Answering and Reading Comprehension (Sep 2018) [code], the authors (Tencent AI Lab) posited that there are three modalities in the reading comprehension setting: question, answer and context. The task of question answering or question generation aims to infer an answer or a question when given the counterpart based on context. They presented a novel two-way neural sequence transduction model that connected the three modalities, allowing it to learn two tasks simultaneously that mutually benefitted one another. Their Dual Ask-Answer Network (DAANet) model architecture comprised a hierarchical process involving a neural sequence transduction model that received string sequences as input and processed them through four layers: embedding, encoding, attention and output. During training, the model received question-context-answer triplets as input and captured the cross-modal interactions via a hierarchical attention process. Unlike previous joint learning paradigms that leveraged the duality of question generation and question answering tasks at the data level, they addressed that duality at the architecture level by mirroring the network structure, and partially sharing components at the different layers. This enabled the knowledge to be transferred from one task to another, helping the model find a general representation for each modality. Evaluation on four datasets showed that their dual-learning model outperformed their mono-learning counterparts – as well as state of the art joint baseline models – on both question answering and question generation tasks.

arxiv1809.01997-fig1.png

[Image source. Click image to open in new window.]


arxiv1809.01997-fig2.png

[Image source. Click image to open in new window.]


arxiv1809.01997-fig5.png

[Image source. Click image to open in new window.]


arxiv1809.01997-table3.png

[Image source. Click image to open in new window.]


arxiv1809.01997-table4.png

[Image source. Click image to open in new window.]


Interpreting phrases such as “Who did what to whom? ” is a major focus in natural language understanding, specifically, semantic role labeling. I Know What You Want: Semantic Learning for Text Comprehension (Sep 2018) attempted to employ semantic role labeling to enhance text comprehension and natural language inference through specifying verbal arguments and their corresponding semantic roles. Embeddings were enhanced by semantic role labels, giving more fine-grained semantics: the salient labels could be conveniently added to existing models, significantly improving deep learning models in challenging text comprehension tasks. This work showed the effectiveness of semantic role labeling in text comprehension and natural language inference, and proposed an easy and feasible scheme to integrate semantic role labeling information in neural models. Experiments on benchmark machine reading comprehension and inference datasets verified that the proposed semantic learning helped their system attain a significant improvement over state of the art, baseline models. [“We will make our code and source publicly available soon.” Not available, 2018-10-16.]

arxiv1809.02794-fig1.png

[Image source. Click image to open in new window.]


arxiv1809.02794-fig2.png

[Image source. Click image to open in new window.]


arxiv1809.02794-fig3.png

[Image source. Click image to open in new window.]


ELMo word embeddings were employed. The dimension of embedding was a critical hyperparameter that influenced the performance: too high of a dimension caused severe overfitting issues, while a dimension that was too low caused underfitting; 5-dimension semantic role label embedding gave the best performance on both the SNLI and SQuAD datasets.

As employed in I Know What You Want: Semantic Learning for Text Comprehension, ELMo embeddings were also used by Ouchi et al. in A Span Selection Model for Semantic Role Labeling (Oct 2018) [code] in a Bi-LSTM, span-based model that employed a IOB/BIO tagging approach. Typically, in this approach, models firstly identify candidate argument spans (argument identification) and then classify each span into one of the semantic role labels (argument classification). In related recent work (Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling), He et al.(2018) also proposed a span-based SRL model similar to Ouchi et al.’s I Know What You Want: Semantic Learning for Text Comprehension. While He et al. also used Bi-LSTM to induce span representations in an end-to-end fashion, a main difference was that while they model $\small P(r | i,j)$, Ouchi et al. modeled $\small P(i,j | r)$. In other words, while He et al.’s model sought to select an appropriate label for each span (label selection), Ouchi et al.’s model selected appropriate spans for each label (span selection).

Ouchi et al.:

arxiv-1810.02245.png

[Image source. Click image to open in new window.]


He et al.:

arxiv-1805.04787.png

[Image source. Click image to open in new window.]


Model/results – Ouchi et al.:

arxiv1810.02245-fig1.png

[Image source. Click image to open in new window.]


arxiv1810.02245-table3.png

[Image source. Click image to open in new window.]


Model/results – He et al.:

arxiv1805.04787-fig2+3.png

[Image source. Click image to open in new window.]


arxiv1805.04787-fig4+tables1+2.png

[Image source. Click image to open in new window.]


Discussed elsewhere in this REVIEW, Question Answering by Reasoning Across Documents with Graph Convolutional Networks (Aug 2018) introduced a method (Entity-GCN) which reasons on information spread within/across documents, framed as an inference problem on a graph. Their approach differed from BiDAF and FastQA, which merely concatenate all documents into a single long text and train a standard reading comprehension model. Instead, they framed question answering as an inference problem on a graph representing the document collection.

Machine reading comprehension with unanswerable questions is a new challenging task for natural language processing. A key subtask is to reliably predict whether the question is unanswerable. U-Net: Machine Reading Comprehension with Unanswerable Questions (Oct 2018) proposed a unified model (U-Net) with three important components: answer pointer, no-answer pointer, and answer verifier. They introduced a universal node and thus processed the question and its context passage as a single contiguous sequence of tokens. The universal node encoded the fused information from both the question and passage, and played an important role in predicting whether the question was answerable. Different from other state of the art pipeline models, U-Net could be learned in an end-to-end fashion. Experimental results on the SQuAD2.0 dataset showed that U-Net could effectively predict the unanswerability of questions, achieving an $\small F_1$ score of 71.7 on SQuAD2.0.

arxiv1810.06638-t1.png

[Image source. Click image to open in new window.]


arxiv1810.06638-f1.png

[Image source. Click image to open in new window.]


arxiv1810.06638-t2.png

[Image source. Click image to open in new window.]


“Our model achieves an $\small F_1$ score of 74.0 and an EM score of 70.3 on the development set, and an $\small F_1$ score of 72.6 and an EM score of 69.2 on Test set 1, as shown in Table 2. Our model outperforms most of the previous approaches. Comparing to the best-performing systems, our model has a simple architecture and is an end-to-end model. In fact, among all the end-to-end models, we achieve the best $\small F_1$ scores. We believe that the performance of the U-Net can be boosted with an additional post-processing step to verify answers using approaches such as (Hu et al. 2018).”

Text embedding representing natural language documents in a semantic vector space can be used for document retrieval using nearest neighbor lookup. Text Embeddings for Retrieval From a Large Knowledge Base (Oct 2018; Christian Szegedy at Google Inc. and authors at the University of Arkansas) studied the feasibility of neural models specialized for retrieval in a semantically meaningful way, suggesting the use of SQuAD in an open-domain question answering context where the first task was to find paragraphs useful for answering a given question. They first compared the quality of various text-embedding methods on the performance of retrieval, and gave an empirical comparisons on the performance of various non-augmented base embeddings with/without IDF weighting. Training deep residual neural models specifically for retrieval purposes yielded significant gains when it was used to augment existing embeddings. They also established that deeper models were superior to this task. The best base baseline embeddings augmented by their learned neural approach improved the top-1 paragraph recall of the system by 14%.

arxiv1810.10176-f1.png

[Image source. Click image to open in new window.]


arxiv1810.10176-f2.png

[Image source. Click image to open in new window.]


arxiv1810.10176-t9+t10.png

[Image source. Click image to open in new window.]


Improving Machine Reading Comprehension with General Reading Strategies (Oct 2018) proposed three simple domain-independent strategies aimed to improve non-extractive machine reading comprehension (MRC):

  • BACK AND FORTH READING, which considers both the original and reverse order of an input sequence,
  • HIGHLIGHTING, which adds a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers, and
  • SELF-ASSESSMENT, which generates practice questions and candidate answers directly from the text in an unsupervised manner.

“By fine-tuning a pre-trained language model (Radford et al., 2018) [OpenAI’s Finetuned Transformer LM] with our proposed strategies on the largest existing general domain multiple-choice MRC dataset RACE, we obtain a 5.8% absolute increase in accuracy over the previous best result achieved by the same pre-trained model fine-tuned on RACE without the use of strategies. We further fine-tune the resulting model on a target task, leading to new state-of-the-art results on six representative non-extractive MRC datasets from different domains (i.e., ARC, OpenBookQA, MCTest, MultiRC, SemEval-2018, and ROCStories). These results indicate the effectiveness of the proposed strategies and the versatility and general applicability of our fine-tuned models that incorporate these strategies.”

arxiv1810.13441-f1.png

[Image source. Click image to open in new window.]


arxiv1810.13441-t2.png

[Image source. Click image to open in new window.]


Question Answering and Reading Comprehension:

Additional Reading

  • Textbook Question Answering with Knowledge Graph Understanding and Unsupervised Open-set Text Comprehension (Nov 2018)

    “In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of TQA dataset. First, it requires to comprehend long lessons to extract knowledge. To tackle this issue of extracting knowledge features from long lessons, we establish knowledge graph from texts and incorporate graph convolutional network (GCN). Second, scientific terms are not spread over the chapters and data splits in TQA dataset. To overcome this so called `out-of-domain’ issue, we add novel unsupervised text learning process without any annotations before learning QA problems. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating GCN for extracting knowledge from long lessons and our newly proposed unsupervised learning process are meaningful to solve this problem.”

    arxiv1811.00232-f1+f4.png

    [Image source. Click image to open in new window.]


    arxiv1811.00232-f3.png

    [Image source. Click image to open in new window.]


    arxiv1811.00232-t1.png

    [Image source. Click image to open in new window.]


    arxiv1811.00232-f5.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Probing the Nature (Transparency) of Reasoning Architectures

Compositional Attention Networks for Machine Reasoning (Apr 2018) [code] by Drew Hudson and Christopher Manning presented the MAC network, a novel fully differentiable neural network architecture designed to facilitate explicit and expressive reasoning. MAC moved away from monolithic black-box neural architectures toward a design that encouraged both transparency and versatility. The model approached problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintained a separation between control and memory. By stringing the cells together and imposing structural constraints that regulated their interaction, MAC effectively learned to perform iterative reasoning processes that were directly inferred from the data in an end-to-end approach. They demonstrated the model’s strength, robustness and interpretability on the challenging CLEVR dataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy, halving the error rate of the previous best model. More importantly, they showed that the model was computationally efficient and data efficient, in particular requiring 5x less data than existing models to achieve strong results.

arxiv1803.03067-f1.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f2.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f3.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f4.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f5.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f6.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f7.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f8.png

[Image source. Click image to open in new window.]


arxiv1803.03067-t1.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f11.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f13.png

[Image source. Click image to open in new window.]


arxiv1803.03067-f14.png

[Image source. Click image to open in new window.]


Manning’s MAC Net is a compositional attention network designed for visual question answering (VQA). In a very similar approach [Compositional Attention Networks for Interpretability in Natural Language Question Answering (Oct 2018)], Saama AI Research (India) proposed a modified MAC Net architecture for natural language question answering. Question Answering typically requires language understanding and multistep reasoning. MAC Net’s unique architecture – the separation between memory and control – facilitated data-driven iterative reasoning, making it an ideal candidate for solving tasks that involve logical reasoning. Experiments with the 20 bAbI tasks demonstrated the value of MAC Net as a data efficient and interpretable architecture for natural language question answering. The transparent nature of MAC Net provided a highly granular view of the reasoning steps taken by the network in answering a query.

arxiv1810.12698-f1+f2+f3+f5.png

[Image source. Click image to open in new window.]


arxiv1810.12698-f4+f9+f10.png

[Image source. Click image to open in new window.]


[Table of Contents]

Probing the Shortcomings of Shallow Trained Language Models

While on the surface LSTM based approaches generally appear to perform well for memory and recall, upon deeper inspection they can also display significant limitations. For example, around mid-2018 I conducted a cursory examination of the BiDAF/SQuAD question answering model online demo  [alternate site], in which I found that their BiDAF model performed well on some queries but failed on other semantically and syntactically identical questions (e.g. with changes in character case, or punctuation), as well as queries on entities not present in the text. While BiDAF a employed hierarchical multi-stage process consisting of six layers (character embedding, word embedding, contextual embedding, attention flow, modeling and output layers), it employed GloVe pretrained word vectors for the word embedding layer to map each word to a high-dimensional vector space (a fixed embedding of each word). This led me to suspect that the shallow embeddings encoded in the GloVe pretrained word vectors failed to capture the nuances of processed text.

[excerpted/paraphrased from NLP’s ImageNet moment has arrived]:

“Pretrained word vectors have brought NLP a long way. Proposed in 2013 as an approximation to language modeling, word2vec found adoption through its efficiency and ease of use … word embeddings pretrained on large amounts of unlabeled data via algorithms such as word2vec and GloVe are used to initialize the first layer of a neural network, the rest of which is then trained on data of a particular task. … Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model—the rest of the network still needs to be trained from scratch.

“Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

“At the core of the recent advances of ULMFiT, ELMo, and the Finetuned Transformer LM is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.”

Recent discussions by Stanford University researchers (Adversarial Examples for Evaluating Reading Comprehension Systems (Jul 2017) [codediscussion]) are also highly apropos to this issue, motivating related research jointly by investigators at the University of Chicago, Google, and Google Brain (Did the Model Understand the Question? (May 2018) [code]). Adversarial challenges to SQuAD1.1 (e.g., by adding adversarially inserted sentences to text passages without changing the correct answer or misleading humans) easily distracted recurrent neural network/attentional-based algorithms like BiDAF and LSTM, leading to incorrect answers. Additionally, although deep learning networks were quite successful overall, they often ignored important question terms and were easily perturbed by adversarial-modified content – again giving incorrect answers.

Stanford (2017):

arxiv1707.07328-fig1.png

[Image source. Click image to open in new window.]


arxiv1707.07328-fig2.png

[Image source. Click image to open in new window.]


Google (2018):

arxiv1805.05492-table4.png

[Image source. Click image to open in new window.]


Unquestionably, LSTM based language models have been important drivers of progress in NLP, as reviewed in

LSTM are commonly employed for textual summarization, question answering, natural language understanding, natural language inference, and commonsense reasoning tasks. Increasingly however, NLP researchers and practitioners have questioning both the relevance and performance of RNN/LSTM as models for learning natural language. In this regard, Sebastian Ruder included these comments in his recent post, ACL 2018 highlights:

Another way to gain a better understanding of a [NLP] model is to analyze its inductive bias. The “Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP” (RELNLP) sought to explore how useful it is to incorporate linguistic structure into our models. One of the key points of Chris Dyer’s talk during the workshop was whether RNNs have a useful inductive bias for NLP. In particular, he argued that there are several pieces of evidence indicating that RNNs prefer sequential recency, namely:

  • Gradients become attenuated across time. LSTMs or GRUs may help with this, but they also forget.
  • People have used training regimes like reversing the input sequence for machine translation.
  • People have used enhancements like attention to have direct connections back in time.
  • For modeling subject-verb agreement, the error rate increases with the number of attractors [Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies (Nov 2016)]

According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don’t seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour. Recurrent neural network grammars, a class of models that generates both a tree and a sequence sequentially by compressing a sentence into its constituents, instead have a bias for syntactic (rather than sequential) recency [Recurrent Neural Network Grammars (Oct 2016)]. However, it can often be hard to identify whether a model has a useful inductive bias. For identifying subject-verb agreement, Chris hypothesizes that LSTM language models learn a non-structural “first noun” heuristic that relies on matching the verb to the first noun in the sentence. In general, perplexity (and other aggregate metrics) are correlated with syntactic/structural competence, but are not particularly sensitive at distinguishing structurally sensitive models from models that use a simpler heuristic.

Understanding the failure modes of LSTMs

Better understanding representations was also a theme at the Representation Learning for NLP workshop. During his talk, Yoav Goldberg detailed some of the efforts of his group to better understand representations of RNNs. In particular, he discussed his recent work on extracting a finite state automaton from an RNN in order to better understand what the model has learned [Weiss, Goldberg & Yahav: Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples (Jun 2018) … “In this work, however, we will focus on GRUs (Cho et al., 2014; Chung et al., 2014) and LSTMs (Hochreiter & Schmidhuber, 1997), as they are more widely used in practice.”]. He also reminded the audience that LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data. Even when a model has been trained using a domain-adversarial loss to produce representations that are invariant of a certain aspect, the representations will be still slightly predictive of said attribute. It can thus be a challenge to completely remove unwanted information from encoded language data and even seemingly perfect LSTM models may have hidden failure modes. On the topic of failure modes of LSTMs, a statement that also fits well in this theme was uttered by this year’s recipient of the ACL lifetime achievement award, Mark Steedman. He asked ‘LSTMs work in practice, but can they work in theory?’

A UC-Berkeley paper by John Miller and Moritz Hardt, When Recurrent Models Don’t Need To Be Recurrent (May 2018) [author’s discussiondiscussion), studied the gap between recurrent and feedforward models trained using gradient descent. They proved that stable RNN are well approximated by feedforward networks for the purpose of both inference and training by gradient descent. If the recurrent model is stable (meaning the gradients can not explode), then the model can be well-approximated by a feedforward network for the purposes of both inference and training. In other words, they showed feedforward and stable recurrent models trained by gradient descent are equivalent in the sense of making identical predictions at test-time. [Of course, not all models trained in practice are stable: they also gave empirical evidence the stability condition could be imposed on certain recurrent models without loss in performance.]

Autoregressive, feed-forward model: Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts $\small y_t$ using only the $\small k$ most recent inputs, $\small x_{t-k+1}, \ldots, x_t$. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the $\small k$ most recent inputs. Google’s WaveNet nicely illustrates this general principle. [Source: When Recurrent Models Don’t Need to be Recurrent]

WaveNet.gif

[Image source. Click image to open in new window.]


Recurrent models feature flexibility and expressivity that come at a cost. Empirical experience shows that RNNs are often more delicate to tune and more brittle to train than standard feedforward architectures. Recurrent architectures can also introduce significant computational burden compared with feedforward implementations. In response to these shortcomings, a growing line of empirical research demonstrates that replacing recurrent models by feedforward models is effective in important applications including translation, speech synthesis, and language modeling (When Recurrent Models Don't Need To Be Recurrent). In contrast to an RNN, the limited context of a feedforward model means that it cannot capture patterns that extend more than $\small k$ steps. Although it appears that the trainability and parallelization for feedforward models comes at the price of reduced accuracy, there have been several recent examples showing that feedforward networks can actually achieve the same accuracies as their recurrent counterparts on benchmark tasks, including language modeling, machine translation, and speech synthesis. With regard to language modeling – in which the goal is to predict the next word in a document given all of the previous words – feedforward models make predictions using only the $\small k$ most recent words, whereas recurrent models can potentially use the entire document. The gated-convolutional language model is a feedforward autoregressive model that is competitive with large LSTM baseline models. Despite using a truncation length of $\small k = 25$, the model outperforms a large LSTM on the Wikitext-103 benchmark, which is designed to reward models that capture long-term dependencies. On the Billion Word Benchmark, the model is slightly worse than the largest LSTM, but is faster to train and uses fewer resources. This is perplexing, since recurrent models seem to be more powerful a priori.

When Recurrent Models Don't Need To Be Recurrent coauthor John Miller continues this discussion in his excellent blog post:

  • One explanation for this phenomenon is given by Dauphin et al. in Language Modeling with Gated Convolutional Networks (Sep 2017):

    arxiv1612.08083-fig1.pngttention and Gated (Recurrent) Units

    [Image source. Click image to open in new window.]


    • From that paper:

      Gating has been shown to be essential for recurrent neural networks to reach state-of-the-art performance. Our gated linear units reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients while retaining non-linear capabilities (Section 5.2). We show that gated convolutional networks outperform other recently published language models such as LSTMs trained in a similar setting on the Google Billion Word Benchmark (Chelba et al., 2013). …

      “Gating mechanisms control the path through which information flows in the network and have proven to be useful for recurrent neural networks. LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep. In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers. We show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. …

      “Gated linear units are a simplified gating mechanism based on the work of Dauphin & Grangier [Predicting distributions with Linearizing Belief Networks (Nov 2015; updated May 2016)] for non-deterministic gates that reduce the vanishing gradient problem by having linear units coupled to the gates. This retains the non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. … We compare the different gating schemes experimentally in Section 5.2 and we find gated linear units allow for faster convergence to better perplexities.”

  • Another explanation is given by Bai et al. (Apr 2018): “The unlimited context offered by recurrent models is not strictly necessary for language modeling.”

    In other words, it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view (Prediction with a Short Memory). Another explanation is given by Bai et al. (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling):

  • “The ‘infinite memory’ advantage of RNNs is largely absent in practice.”

    As Bai et al. report, even in experiments explicitly requiring long-term context, RNN variants were unable to learn long sequences. On the Billion Word Benchmark, an intriguing Google Technical Report suggests an LSTM $\small n$-gram model with $\small n=13$ words of memory is as good as an LSTM with arbitrary context (N-gram Language Modeling using Recurrent Neural Network Estimation). This evidence leads us to conjecture:

  • “Recurrent models trained in practice are effectively feedforward.”

    This could happen either because truncated backpropagation time cannot learn patterns significantly longer than $\small k$ steps, or, more provocatively, because models trainable by gradient descent cannot have long-term memory.

We know very little about how neural language models (LM) use prior linguistic context. A recent paper by Dan Jurafsky and colleagues at Stanford University, Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context (May 2018) investigated the role of context in a LSTM based LM, through ablation studies. On two standard datasets (Penn Treebank and WikiText-2) they found that the model was capable of using about 200 tokens of context on average, but sharply distinguished nearby context (recent 50 tokens) from the distant history. The model was highly sensitive to the order of words within the most recent sentence, but ignored word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. They further found that the neural caching model (Improving Neural Language Models with a Continuous Cache) especially helped the LSTM copy words from within this distant context. Paraphrased from that paper:

  • “In this analytic study, we have empirically shown that a standard LSTM language model can effectively use about 200 tokens of context on two benchmark datasets, regardless of hyperparameter settings such as model size. It is sensitive to word order in the nearby context, but less so in the long-range context. In addition, the model is able to regenerate words from nearby context, but heavily relies on caches to copy words from far away.”

  • The neural cache model (Improving Neural Language Models with a Continuous Cache (Dec 2016)] augments neural language models with a longer-term memory that dynamically updates the word probabilities based on the long-term context. The neural cache stores the previous hidden states in memory cells for use as keys to retrieve their corresponding (next) word. A neural cache can be added on top of a pretrained language model at negligible cost.

While LSTM has been successfully used to model sequential data of variable length, LSTM can experience difficulty in capturing long-term dependencies. Long Short-Term Memory with Dynamic Skip Connections (Nov 2018) tried to alleviate this problem by introducing a dynamic skip connection, which could learn to directly connect two dependent words. Since there was no dependency information in the training data, they proposed a novel reinforcement learning-based method to model the dependency relationship and connect dependent words. The proposed model computed the recurrent transition functions based on the skip connections, which provided a dynamic skipping advantage over RNNs that always tackle entire sentences sequentially. Experimental results on three NLP tasks demonstrated that the proposed method could achieve better performance than existing methods, and in a number prediction experiment the proposed model outperformed LSTM with respect to accuracy by nearly 20%.

arxiv1811.03873-f2.png

[Image source. Click image to open in new window.]


arxiv1811.03873-t1+t2+t3+t5+f6.png

[Image source. Click image to open in new window.]


Question Answering and Reading Comprehension:

Additional Reading

  • A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC (Sep 2018)

    “Across all of the datasets, there exists at least one other dataset that significantly improves performance on a target dataset. These experiments do not support that direct transfer is possible, but that pretraining is at least somewhat effective. QuAC appears to transfer the least to any of other datasets, likely because questioners were not allowed to see underlying context documents while formulating questions. Since transfer is effective between these related tasks, we recommend that future work indicate any pretraining.”

  • Stochastic Answer Networks for SQuAD 2.0 (Sep 2018) [code]

    arxiv1809.09194-fig1.png

    [Image source. Click image to open in new window.]


    arxiv1809.09194-fig2.png

    [Image source. Click image to open in new window.]


    “To sum up, we proposed a simple yet efficient model based on SAN [Stochastic Answer Network]. It showed that the joint learning algorithm boosted the performance on SQuAD2.0. We also would like to incorporate ELMo into our model in future.”

  • AUEB at BioASQ 6: Document and Snippet Retrieval (Sep 2018) [code]

    arxiv1809.06366-fig6.png

    [Image source. Click image to open in new window.]


    “We presented the models, experimental set-up, and results of AUEB’s submissions to the document and snippet retrieval tasks of the sixth year of the BioASQ challenge. Our results show that deep learning models are not only competitive in both tasks, but in aggregate were the top scoring systems. This is in contrast to previous years where traditional IR systems tended to dominate. In future years, as deep ranking models improve and training data sets get larger, we expect to see bigger gains from deep learning models.”

  • A Knowledge Hunting Framework for Common Sense Reasoning (Oct 2018) [MILA/McGill University; Microsoft Research Montreal] [code]

    “We developed a knowledge-hunting framework to tackle the Winograd Schema Challenge (WSC), a task that requires common-sense knowledge and reasoning. Our system involves a semantic representation schema and an antecedent selection process that acts on web-search results. We evaluated the performance of our framework on the original set of WSC instances, achieving F1-performance that significantly exceeded the previous state of the art. A simple port of our approach to COPA [Choice of Plausible Alternatives] suggests that it has the potential to generalize. In the future we will study how this commonsense reasoning technique can contribute to solving ‘edge cases’ and difficult examples in more general coreference tasks.”

    arxiv-1810.01375a.png

    [Image source. Click image to open in new window.]


    arxiv-1810.01375b.png

    [Image source. Click image to open in new window.]


  • A Fully Attention-Based Information Retriever (Oct 2018) [code]

    • “Recurrent neural networks are now the state-of-the-art in natural language processing because they can build rich contextual representations and process texts of arbitrary length. However, recent developments on attention mechanisms have equipped feedforward networks with similar capabilities, hence enabling faster computations due to the increase in the number of operations that can be parallelized. We explore this new type of architecture in the domain of question-answering and propose a novel approach that we call Fully Attention Based Information Retriever (FABIR). We show that FABIR achieves competitive results in the Stanford Question Answering Dataset (SQuAD) while having fewer parameters and being faster at both learning and inference than rival methods.”

      “The experiments validate that attention mechanisms alone are enough to power an effective question-answering model. Above all, FABIR proved roughly five times faster at both training and inference than BiDAF, a competing RNN-based model with similar performance. … Although FABIR is still far from surpassing the models at the top of the <a href=”https://rajpurkar.github.io/SQuAD-explorer/”green”>SQuAD leaderboard</font></a> (Table III), we believe that its faster and lighter architecture already make it an attractive alternative to RNN-based models, especially for applications with limited processing power or that require low-latency.”

    • Critique.

      Like FABIR (which is also evaluated with the attention module only, minus convolution – giving satisfactory results), QANet (Apr 2018) is a QA architecture that consists entirely of convolution and self-attention, on the SQuAD dataset is 3x to 13x faster in training and 4x to 9x faster in inference on the state of the art at that time, and places highly on the SQuAD1.1 Leaderboard (2018-10-23). However, the FABIR paper [A Fully Attention-Based Information Retriever (Oct 2018)] fails to cite the earlier, more performant QANet work [*QANet*: Combining Local Convolution with Global Self-Attention for Reading Comprehension].*

    arxiv1810.09580-f1.png

    [Image source. Click image to open in new window.]


    arxiv1810.09580-f2.png

    [Image source. Click image to open in new window.]


    arxiv1810.09580-t1.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Natural Language Inference

Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise $\small p$ (e.g. a piece of text) and a hypothesis $\small h$. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI Corpus), contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of NLI. A newer, Multi-Genre Natural Language Inference (MultiNLI corpus) is also available: a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

NLI was one of the 10 tasks proposed in The Natural Language Decathlon: Multitask Learning as Question Answering, a NLP challenge spanning 10 tasks introduced by Richard Socher and colleagues at Salesforce.

Google’s (A Decomposable Attention Model for Natural Language Inference (Parikh et al., Sep 2016) likewise proposed a simple neural architecture for natural language inference that used attention to decompose the problem into subproblems that could be solved separately, thus making it trivially parallelizable. Their use of attention was purely based on word embeddings, essentially consisting of feedforward networks that operated largely independently of word order. On the Stanford Natural Language Inference (SNLI) dataset, they obtained state of the art results with almost an order of magnitude fewer parameters than previous work, without relying on any word-order information. The approach outperformed considerably more complex neural methods aiming for text understanding, suggesting that – at least for that task – pairwise comparisons are relatively more important than global sentence-level representations.

arxiv1606.0193-fig1.png

[Image source. Click image to open in new window.]


arxiv1606.0193-tables1+2.png

[Image source. Click image to open in new window.]


However,

  • that same model (Parikh et al. 2016; see Table 3 in the image below),
  • and also one based on a Bi-LSTM-based single sentence-encoding model without attention (ibid.),
  • and a hybrid TreeLSTM-based and Bi-LSTM-based model with an inter-sentence attention mechanism to align words across sentences (ibid.)

… all performed poorly on the newer “Breaking NLI” NLI test set, indicating the difficulty of the task (and reiterating the need for ever more challenging datasets). The new examples were simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set was substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences. That finding recalls my earlier discussion on adversarial challenges to BiDAF/SQuAD-based QA.

arxiv1805.02266-table3b.png

[Image source. Click image to open in new window.]


arxiv1808.03894-fig1.png

[Image source. Click image to open in new window.]


Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. However, efforts to obtain embeddings for larger chunks of text, such as sentences, have not been as successful: several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. For a long time supervised learning of sentence embeddings was thought to give lower-quality embeddings than unsupervised approaches but this assumption has recently been overturned, in part following the publication of the InferSent model by Facebook AI Research (Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (May 2017; updated Jul 2018) [code;  discussion: A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings, and reddit]). The authors showed how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference (SNLI) datasets could consistently outperform unsupervised methods, like SkipThought vectors, on a wide range of transfer tasks – indicating the suitability of NLI for transfer learning to other NLP tasks. Much like how computer vision used ImageNet to obtain features, which could then be transferred to other tasks, their work indicated the suitability of natural language inference for transfer learning to other NLP tasks.

arxiv1705.02364-f1.png

[Image source. Click image to open in new window.]


arxiv1705.02364-f4.png

[Image source. Click image to open in new window.]


  • InferSent was an interesting approach by the simplicity of its architecture, a bi-directional LSTM complete with a max-pooling operator as sentence encoder. InferSent used the SNLI dataset (a set of of 570k pairs of sentences labeled with 3 categories: neutral, contradiction and entailment) to train the classifier on top of the sentence encoder. Both sentences were encoded using the same encoder, while the classifier was trained on a pair representation constructed from the two sentence embeddings.

In a very similar architecture to InferSent (compare the images above/below), Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture (Aug 2018) [code] from the University of Helsinki yielded state of the art results for SNLI sentence encoding-based models and the SciTail dataset, and provided strong results for the MultiNLI dataset. [The SciTail dataset is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs. Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis.] The sentence embeddings could be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7/10 and SkipThought on 8/9 SentEval sentence embedding evaluation tasks. Furthermore, their model beat the InferSent in 8/10 recently published SentEval probing tasks designed to evaluate the ability of sentence embeddings to capture some of the important linguistic properties of sentences.

arxiv1808.08762-f1.png

[Image source. Click image to open in new window.]


arxiv1808.08762-f2.png

[Image source. Click image to open in new window.]


arxiv1808.08762-t9.png

[Image source. Click image to open in new window.]


“The success of the proposed hierarchical architecture raises a number of additional interesting questions. First, it would be important to understand what kind of semantic information the different layers are able to capture. Second, a detailed and systematic comparison of different hierarchical architecture configurations, combining BiLSTM and max pooling in different ways, could lead to even stronger results, as indicated by the results we obtained on the SciTail dataset with the modified 4-layered model. Also, as the sentence embedding approaches for NLI focus mostly on the sentence encoder, we think that more should be done to study the classifier part of the overall NLI architecture. There is not enough research on classifiers for NLI and we hypothesize that further improvements can be achieved by a systematic study of different classifier architectures, starting from the way the two sentence embeddings are combined before passing on to the classifier.”

Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. Bridging Knowledge Gaps in Neural Entailment via Symbolic Models (Sep 2018) focused on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Their architecture (NSnet) combined standard neural entailment models with a knowledge lookup module. To facilitate this lookup, they proposed a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. NSnet learned to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSnet outperformed a simpler combination of the two predictions by 3% and the base entailment model by 5%.

arxiv1808.09333-f1.png

[Image source. Click image to open in new window.]


arxiv1808.09333-f2.png

[Image source. Click image to open in new window.]


arxiv1808.09333-t3.png

[Image source. Click image to open in new window.]


arxiv1808.09333-t1.png

[Image source. Click image to open in new window.]


[Table of Contents]

ADDRESSING INFORMATION OVERLOAD

The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time (Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning). Even within the more focused health, technical and scientific domains we face a continuous onslaught of new information and knowledge from which we must filter out the non-relevant information, seeking to retain (or hoping to find again) knowledge that is relevant to us.

Information overload is characterized by the difficulty of understanding an issue and effectively making decisions when one has too much information about that issue. In our infocentric world, we have an increasing dependency on relevant, accurate information that is buried in the avalanche of continuously generated information. Coincident with information overload is the phenomenon of attention overload: we have limited attention and we’re not always sure where to direct it. It can be difficult to limit how much information we consume when there’s always something new waiting for a click; before we know it, an abundance of messy and complex information has infiltrated our minds. If our processing strategies don’t keep pace, our online explorations create strained confusion instead of informed clarity. Hence, More information is not necessarily better.

  • When Choice is Demotivating: Can One Desire Too Much of a Good Thing? discussed findings from 3 experimental studies that starkly challenged the implicit assumption that having more choices is more intrinsically motivating than having fewer choices. Those experiments, which were conducted in both field and laboratory settings, showed that people are more likely to make purchases or undertake optional classroom essay assignments when offered a limited array of 6 choices, rather than a more extensive array of 24 or 30 choices. Moreover, participants reported greater subsequent satisfaction with their selections and wrote better essays when their original set of options had been limited.

  • Information overload is a long-standing issue: in her 2010 book Too Much To Know: Managing Scholarly Information before the Modern Age, Harvard Department of History Professor Ann Blair argued that the early modern methods of selecting, summarizing, sorting, and storing text (the 4S’s) are at the root of the techniques we use today in information management.

  • For more discussion, see the first part of the blog post Information Overload, Fake News, and Invisible Gorillas  [local copy].

The construction of a well-crafted biomedical textual knowledge store (TKS) – with a focus on high quality, high impact material – partly addresses the issue of information overload. TKS provide a curated source of preselected and stored textual material, upon which programmatic approaches such as text mining and data visualization provide a more focused, deeper understanding. The application of advanced NLP and ML methods applied to TKS will assist the processing (e.g. clustering; dimensionality reduction; ranking; summarization; etc.) and understanding of PubMed/PubMed Central (PM/PMC) and other textual sources based on user-defined interests. As well, techniques such as active learning and analyses of user patterns and preferences could assist refined queries to external sources (PubMed and other online search engines), and the processing of those new data, in an increasingly focused iterative approach. The incorporation of vector space approaches and other “fuzzy” search paradigms could incidentally assist knowledge discovery. Algorithms and software acting as intelligent agents could automatically, autonomously and adaptively scan PM/PMC and other sources for new knowledge in multiple subject/topic areas; for example, monitoring the biomedical literature for new information relevant to genomic variants.

The new information that is retrieved from TKS may also be cross-queried against knowledge graphs to recover additional knowledge and discover new relations. The increasing availability of personalized genomic information and other personalized medical information will drive a demand for access to high quality TKS and methods to efficiently query and process those knowledge stores. Intelligently designed graphical user interfaces will allow the querying and viewing of those data (text; graphs; pathways and networks; images; etc.), per the users needs.

[Table of Contents]

Text Classification

There is an increasing need for semi-supervised and unsupervised tools that can pre-process, analyse and classify raw text to extract interpretable content; for example, identifying topics, and content-driven groupings of articles. One approach for information overload is to classify the documents, to grouping related information. Accurate document classification is a key component to ensuring quality of any digital library: the non-classification of documents impedes systems and hence users from finding useful information.

The Word Mover's Distance (WMD) – introduced in From Word Embeddings To Document Distances (2015) [codetutorialtutorial: Python implementation with WMD paper coauthor Matt Kusner)] – is a novel distance function between text documents, based on results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences.

The WMD distance measures the dissimilarity between two text documents as the minimum cumulative distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Although two documents may not share any words in common, WMD can still measure the semantical similarity by considering their word embeddings, while other bag-of-words or term frequency-inverse document frequency (TF-IDF) methods only measure the similarity by the appearance of words.

The WMD metric has no hyperparameters and is straightforward to implement. Furthermore, on eight real world document classification data sets, in comparison with seven state of the art baselines, the WMD metric demonstrated unprecedented low $\small k$-nearest neighbor document classification error rates.

wmd-f1+f2.png

[Image source. Click image to open in new window.]


wmd-f3+f4+t2.png

[Image source. Click image to open in new window.]


In the biomedical domain, Bridging the Gap: Incorporating a Semantic Similarity Measure for Effectively Mapping PubMed Queries to Documents (published Nov 2017) presented a query-document similarity measure motivated by WMD that relied on neural word embeddings to compute the distance between words. Unlike other similarity measures, their (WMD) method relied on neural word embeddings to compute the distance between words, which helped identify related words when no direct matches were found between a query and a document (e.g., as shown in Fig. 1 in that paper).

arxiv1608.01972-f1.png

[Image source. Click image to open in new window.]


In Representation Learning of Entities and Documents from Knowledge Base Descriptions – jointly by Studio Ousia and collaborators – the authors described TextEnt, a neural network model that learned distributed representations of entities and documents directly from a knowledge base. Given a document in a knowledge base consisting of words and entity annotations, they trained their model to predict the entity that the document described and mapped the document and its target entity close to each other in a continuous vector space. Their model (Fig. 2
Source
in that paper) was trained using a large number of documents extracted from Wikipedia. TextEnt (which performed somewhat better than their Wikipedia2Vec baseline model) used the last, fully-connected layer to classify documents into a set of pretrained classes (their Table 3
Source
).

This is similar to how the fully-connected layer in various ImageNet image classification models classify images into predefined categories – and for which, removing the last fully-connected layer enabled the use of those models for transfer learning, as described in the Stanford cs231n course page Transfer Learning  [local copy].

Recent word embedding methods such as word2vec [Efficient Estimation of Word Representations in Vector Space (Sep 2013)], introduced by Tomas Mikolov et al. at Google in 2013, are capable of learning semantic meaning and similarity between words in an entirely unsupervised manner using a contextual window, doing so much faster than previous methods. [For an interesting discussion, see Word2Vec and FastText Word Embedding with Gensim.]

arxiv1301.3781-f1.png

[Image source. Click image to open in new window.]


While vector space representations of words succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, the origins of that success remained opaque. GloVe: Global Vectors for Word Representation (2014) [project] – by Jeffrey Pennington, Richard Socher and Christopher D. Manning at Stanford University – analyzed and made explicit the model properties needed for such regularities to emerge in word vectors. The result was a new global log-bilinear regression model (GloVe ) that combined the advantages of the two major model families in the literature: global matrix factorization, and local context window methods. GloVe efficiently leveraged statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produced a vector space with meaningful substructure, as evidenced by its performance of 75% on a word analogy task. GloVe also outperformed related models on similarity tasks and named entity recognition.

FastText  [GitHub] – a library for efficient learning of word representations and sentence classification – is an extension to word2vec that was introduced by Tomas Mikolov (now at) and colleagues at Facebook AI Research in a series of papers in 2016:

fastText-1.png

[Image source. Click image to open in new window.]


cbo_vs_skipgram.png

[Image source. Click image to open in new window.]


arxiv1607.04606-f2.png

[Image source. Click image to open in new window.]


arxiv1607.04606-t7.png

[Image source. Click image to open in new window.]


Like word2vecfastText is an unsupervised learning algorithm for obtaining vector representations for words. Whereas word2vec cannot classify text, fastText can classify text. FastText learns word embeddings in a manner very similar to word2vec except fastText enriches word vectors with subword information using character $\small n$-grams of variable length; for example, the $\small tri$-grams for the word “apple” are “app”, “ppl”, and “ple” (ignoring the starting and ending of boundaries of words). The word embedding vector for “apple” will be the sum of all of these $\small n$-grams. These character $\small n$-grams allow the algorithm to identify prefixes, suffixes, stems, and other phonological, morphological and syntactic structure in a manner that does not rely on words being used in similar context and thus being represented in similar vector space. After training rare words can now be properly represented, since it is highly likely that some of their $\small n$-grams also appear in other words. FastText represents an out-of-vocabulary medical term as the normalized sum of the vector representations of its $\small tri$-grams.

  • Regarding fastText and out of vocabulary (OOV) words, note that text inputted into fastText is lowercased (affecting the embeddings).

  • The effectiveness of word embeddings for downstream NLP tasks is limited by OOV words, for which embeddings do not exist. In 2017 Pinter et al. (Mimicking Word Embeddings using Subword RNNs  [code]) presented MIMICK, an approach to generating OOV word embeddings compositionally by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK did not require retraining on the original word embedding corpus; instead, learning was performed at the type level.

    MIMICK-f1.png

    [Image source. Click image to open in new window.]


    MIMICK-t1.png

    [Image source. Click image to open in new window.]


  • Sebastian Ruder also discussed OOV. Regarding Mimicking Word Embeddings using Subword RNNs, he stated:

    “Another interesting approach to generating OOV word embeddings is to train a character-based model to explicitly re-create pretrained embeddings (Pinter et al., 2017). This is particularly useful in low-resource scenarios, where a large corpus is inaccessible and only pretrained embeddings are available.”

FastText was compared to other approaches on the classification of PubMed abstracts in Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research (Apr 2017), where it performed very well. Interestingly, the embeddings learned by fastText on the entire English Wikipedia worked very well in that task, indicating that the diverse topics covered by Wikipedia provided a rich corpus from which to learn text semantics. In addition, Wikipedia contains documents related to biomedical research such that the vocabulary is not as limited with regard to that domain, compared to models trained on corpora from Freebase and GoogleNews. Performance using GoogleNews embeddings was comparable to Pubmed and Pubmed+Wiki. These results suggested that learning embeddings in a domain-specific corpus is not a requirement for success in these tasks. That conclusion was echoed in A Comparison of Word Embeddings for the Biomedical Natural Language Processing (Jul 2018), which among its conclusions found that the word embeddings trained on biomedical domain corpora do not necessarily have better performance than those trained on other general domain corpora for any downstream biomedical NLP tasks.

arxiv1705.06262-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1802.00400-f1.png

[Image source. Click image to open in new window.]


arxiv1802.00400-t3.png

[Image source. Click image to open in new window.]


A recent, probabilistic extension of fastText, Probabilistic FastText for Multi-Sense Word Embeddings (Jun 2018) [code], produced accurate representations of rare, mis-spelled, and unseen words. Probabilistic FastText outperformed both fastText (which has no probabilistic model) and dictionary-level probabilistic embeddings (which do not incorporate subword structures) on several word-similarity benchmarks, including English rare word and foreign language datasets. It also achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words.

arxiv-1806.02901.png

[Image source. Click image to open in new window.]


arxiv1806.02901-f3.png

[Image source. Click image to open in new window.]


arxiv1806.02901-t1.png

[Image source. Click image to open in new window.]


Work extending the concept of word embeddings to sentence, paragraph and document embeddings was introduced in 2014 by Quoc V. Le and Tomas Mikolov at Google as Paragraph Vectors in Distributed Representations of Sentences and Documents (May 2014) [media: A gentle introduction to Doc2Vec], commonly known as doc2vec. However, there was some controversy as to whether doc2vec could outperform centroid methods, and others struggled to reproduce those results, leading Lau and Baldwin in An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation to perform an extensive comparison between various document embedding methods across different domains. They found that doc2vec performed robustly when using models trained on large external corpora, and could be further improved by using pretrained word embeddings. The general consensus was that different methods are best suited for different tasks; for example, centroids performed well on tweets, but are outperformed on longer documents. [For good discussions of various approaches, see Document Embedding, and The Current Best of Universal Word Embeddings and Sentence Embeddings.]

arxiv1405.4053 - f1+f2+f3.png

[Image source. Click image to open in new window.]


Le and Mikolov’s doc2vec model was later used in a 2018 paper from Imperial College London, Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning.

More recent work on document embedding was presented in Word Mover’s Embedding: From Word2Vec to Document Embedding (Oct 2018) [data, code]. While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called Word Mover's Distance (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. This paper proposed the Word Mover’s Embedding (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In their experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matched or outperformed state of the art techniques, with significantly higher accuracy on problems of short length.

arxiv1811.01713-f1.png

[Image source. Click image to open in new window.]


arxiv1811.01713-t1.png

[Image source. Click image to open in new window.]


arxiv1811.01713-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1811.01713-t5+t6.png

[Image source. Click image to open in new window.]


Google’s 2015 publication Semi-Supervised Sequence Learning (Nov 2015) [code] used two approaches to improve sequence learning with long short-term memory (LSTM) recurrent networks. They first predicted what came next in a sequence via conventional NLP, and then used a sequence autoencoder which read the input sequence into a vector and predicted the input sequence again. These two algorithms could be used as an unsupervised pretraining step for a later supervised sequence learning algorithm. An important result from their experiments was that using more unlabeled data from related tasks in the pretraining improved the generalization (e.g. classification accuracy) of a subsequent supervised model. This was equivalent to adding substantially more labeled data, supporting the thesis that it is possible to use unsupervised learning with more unlabeled data to improve supervised learning. They also found that after being pretrained with the two approaches, LSTM are more stable and generalize better. Thus, this paper showed that it is possible to use LSTM for NLP tasks such as document classification, and that a language model or a sequence autoencoder can help stabilize the learning in LSTM. On five benchmarks, the LSTM reached or surpassed the performance levels of all previous baselines.

arxiv1511.01432-f1.png

[Image source. Click image to open in new window.]


arxiv1511.01432-t2.png

[Image source. Click image to open in new window.]


arxiv1511.01432-t4.png

[Image source. Click image to open in new window.]


arxiv1511.01432-t6.png

[Image source. Click image to open in new window.]


arxiv1511.01432-t7.png

[Image source. Click image to open in new window.]


In addition to the Google-provided code, the text-classification-models-tf repository also provides Tensorflow implementations of various text classification models. Another repository by that GitHub user, Transfer Learning for Text Classification with Tensorflow, provides a TensorFlow implementation of semi-supervised learning for text classification – an implementation of Google’s Semi-supervised Sequence Learning paper.

An independent implementation for vector representation of documents, Doc2VecC (note the extra “C” at the end; Efficient Vector Representation for Documents through Corruption (Jul 2017) [code]) represented each document as a simple average of word embeddings. It ensured that a representation generated as such captured the semantic meanings of the document during learning. Doc2VecC produced significantly better word embeddings than word2vec: the simple model architecture introduced by Doc2VecC matched or out-performed the state of the art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enabled training on billions of words per hour on a single machine; at the same time, the model was very efficient in generating representations of unseen documents at test time.

arxiv1707.02377-f1.png

[Image source. Click image to open in new window.]


Le and Mikolov’s doc2vec model was used in a 2018 paper from Imperial College London, Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning (Aug 2018) [code here and here], which described a methodology that brought together powerful vector embeddings from NLP with tools from graph theory (that exploited diffusive dynamics on graphs) to reveal natural partitions across scales. Their framework used doc2vec to represent text in vector form, then applied a multi-scale community detection method (Markov Stability) to partition a similarity graph of document vectors. The method allowed them to obtain clusters of documents with similar content, at different levels of resolution, in an unsupervised manner. An analysis of a corpus of 9,000 news articles showed consistent groupings of documents according to content without a priori assumptions about the number or type of clusters to be found.

arxiv1808.01175-f1.png

[Image source. Click image to open in new window.]


arxiv1808.01175-f2.png

[Image source. Click image to open in new window.]


arxiv1808.01175-f3.png

[Image source. Click image to open in new window.]


arxiv1808.01175-f4.png

[Image source. Click image to open in new window.]


An Analysis of Hierarchical Text Classification Using Word Embeddings (Sep 2018) studied trained machine learning classification models fastText, XGBoost, SVM, and Keras’ CNN, as well as word embedding generation methods GloVe, word2vec and fastText with publicly available data, evaluating them in the hierarchical classification task. FastText performed as the best classifier, also providing very good results as a word embedding generator despite the relatively small amount of data provided.

arxiv1809.01771-t1.png

[Image source. Click image to open in new window.]


In other work related to sentence embeddings, Unsupervised Learning of Sentence Representations Using Sequence Consistency (Sep 2018) from IBM Research proposed a simple yet powerful unsupervised method to learn universal distributed representations of sentences by enforcing consistency constraints on sequences of tokens, applicable to the classification of text and transfer learning. Their ConsSent model was compared to unsupervised methods (including GloVe, fastText, ELMo, etc.) and supervised methods (including InferSent, etc.) on a classification transfer task in their Table 1, where ConsSent performed very well, overall.

Sentence embedding is an important research topic in NLP: it is essential to generate a good embedding vector that fully reflects the semantic meaning of a sentence in order to achieve an enhanced performance for various NLP tasks. Although two sentences may employ different words or different structures, people will recognize them as the same sentence as long as the implied semantic meanings are highly similar. Hence, a good sentence embedding approach should satisfy the property that if two sentences have different structures but convey the same meaning (i.e., paraphrase sentences), then they should have the same (or at least similar) embedding vectors.

In 2018 Myeongjun Jang and Pilsung Kang at Korea University presented Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition (Oct 2018), which introduced their P-thought  model. Inspired by human language recognition, they proposed the concept of “semantic coherence,” which should be satisfied for good sentence embedding methods: similar sentences should be located close to each other in the embedding space. P-thought was designed as a dual generation model, which received a single sentence as input and generated both the input sentence and its paraphrase sentence, simultaneously. Given a (sentence, paraphrase) sentence tuple, it should be possible to generate both the sentence itself and its paraphrase sentence from the representation vector of an input sentence. For the P-thought model, they employed a seq2seq structure with a gated recurrent unit (GRU) cell. The encoder transformed the sequence of words from an input sentence into a fixed-sized representation vector, whereas the decoder generated the target sentence based on the given sentence representation vector. The P-thought model had two decoders: when the input sentence was given, the first decoder, named “auto-decoder,” generated the input sentence as-is. The second decoder, named “paraphrase-decoder,” generated the paraphrase sentence of the input sentence.

arxiv1808.05505-f1+f2+t1.png

[Image source. Click image to open in new window.]


arxiv1808.05505-f3.png

[Image source. Click image to open in new window.]


arxiv1808.05505-t3.png

[Image source. Click image to open in new window.]


P-thought pursued maximal semantic coherence during training. Compared to a number of baselines (bag of words, Sent2Vec, etc. in their Table 3
Source
) on the MS-COCO datasetInferSent and P-thought far surpassed the other models, with P-thought slightly outperforming InferSent. In the case of P-thought with a one-layer Bi-RNN, the P-coherence value was comparable to that of InferSent (0.7454 and 0.7432, respectively); P-thought with a two layer forward RNN gave a score of 0.7899.

  • Whereas P-thought with a two layer Bi-RNN gave a much higher P-coherence score (0.9725), this was an over-training artefact. The main limitation of that work was that there were insufficient paraphrase sentences for training the models: P-thought models with more complex encoder structures tended to overfit the MS-COCO datasets. Although this problem could be resolved by acquiring more paraphrase sentences, it was not easy for these authors to obtain a large number of paraphrase sentences. [In that regard, note my comments, above, on the DuoRC dataset, which over which P-thought could be trained.]

In Google Brain/OpenAI’s Adversarial Training Methods for Semi-Supervised Text Classification (May 2017) [code | non-author codemedia], adversarial training provided a means of regularizing supervised learning algorithms while virtual adversarial training was able to extend supervised learning algorithms to the semi-supervised setting. However, both methods required making small perturbations to numerous entries of the input vector, which was inappropriate for sparse high-dimensional inputs such as one-hot word representations. The authors extended adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieved state of the art results on multiple benchmark semi-supervised and purely supervised tasks: the learned word embeddings were of higher quality, and the model was less prone to overfitting while training.

arxiv1605.07725-f1.png

[Image source. Click image to open in new window.]


arxiv1605.07725-t2+t3.png

[Image source. Click image to open in new window.]


Hierarchical approaches to document classification include HDLTex: Hierarchical Deep Learning for Text Classification (Oct 2017) [code]. Recently the performance of traditional supervised classifiers has degraded as the number of documents has increased, because accompanying the growth in the number of documents is an increase in the number of categories. This paper approached this problem differently from current document classification methods that viewed the problem as multiclass classification, instead performing “hierarchical classification.” Traditional multi-class classification techniques work well for a limited number classes, but performance drops with increasing number of classes, as is present in hierarchically organized documents. Hierarchical deep learning solves this problem by creating neural architectures that specialize deep learning approaches for their level of the document hierarchy. HDLTex employed stacks of deep learning architectures (RNN, CNN) to provide specialized understanding at each level of the document hierarchy. Testing on a data set of documents obtained from the Web of Science showed that combinations of RNN at the higher level and DNN or CNN at the lower level produced accuracies consistently higher than those obtainable by conventional approaches using naïve Bayes or a support vector machine (SVM).

arxiv1709.08267-f1.png

[Image source. Click image to open in new window.]


arxiv1709.08267-f2.png

[Image source. Click image to open in new window.]


arxiv1709.08267-f3.png

[Image source. Click image to open in new window.]


arxiv1709.08267-t3.png

[Image source. Click image to open in new window.]


Hierarchical Attention Networks for Document Classification (2016; alternate link) [non-author code hereherehere and here] – by authors at Carnegie Mellon University and Microsoft Research – described a hierarchical attention network for document classification. Their model had two distinctive characteristics: a hierarchical structure that mirrored the hierarchical structure of documents, and it had two levels of attention mechanisms (word-level and sentence-level), enabling it to attend differentially to more and less important content when constructing the document representation. Experiments on six large scale text classification tasks demonstrated that the proposed architecture outperformed previous methods by a substantial margin. Visualization of the attention layers illustrated that the model selected qualitatively informative words and sentences.

Yang2016-fig2.png

[Image source. Click image to open in new window.]


Yang2016-table2.png

[Image source. Click image to open in new window.]


Yang2016-fig5+6.png

[Image source. Note also Fig. 1
Source
. Click image to open in new window.]


Recent work (2018) from Stanford University (Training Classifiers with Natural Language Explanations (Aug 2018) [codeworksheetauthor’s blog postdemo videodiscussion]) proposed a framework, BabbleLabble, for training classifiers in which human annotators provided natural language explanations for each labeling decision. A semantic parser converted those explanations into programmatic labeling functions that generated noisy labels for an arbitrary amount of unlabeled data, which were used to train a classifier. On three relation extraction tasks, users were able to train classifiers with comparable $\small F_1$ scores 5-100x faster by providing explanations instead of just labels. Based on $\small F_1$ scores, a classifier trained with BabbleLabble achieved the same accuracy as a classifier trained only with end labels, using between 5x to 100x fewer examples. On the spouse task, 30 explanations were worth around 5,000 labels; on the disease task 30 explanations were worth around 1,500 labels; and on the protein task, 30 explanations were worth around 175 labels.

arxiv1805.03818-f1.png

[Image source. Click image to open in new window.]


arxiv1805.03818-f2.png

[Image source. Click image to open in new window.]


arxiv1805.03818-f3.png

[Image source. Click image to open in new window.]


arxiv1805.03818-f4.png

[Image source. Click image to open in new window.]


arxiv1805.03818-f5.png

[Image source. Click image to open in new window.]


arxiv1805.03818-t4+t5.png

[Image source. Click image to open in new window.]


  • This project is part of the Snorkel project (a training data creation and management system focused on information extraction), the successor to the now deprecated DeepDive project. Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Document relevance ranking, also known as ad-hoc retrieval (Harman, 2005), is the task of ranking documents from a large collection using only the query and the text of each document only. This contrasts with standard information retrieval (IR) systems that rely on text based signals in conjunction with network structure and/or user feedback. Text-based ranking is particularly important when click-logs do not exist or are small, and the network structure of the collection is non-existent or not informative for query-focused relevance. Examples include various domains in digital libraries, e.g. patents or scientific literature (Wu et al., 2015Tsatsaronis et al., 2015), enterprise search, and personal search.

Deep Relevance Ranking Using Enhanced Document-Query Interactions (Sep 2018) [code] by Athens University of Economics and Business and Google AI (2018) explored several new models for document relevance ranking, building upon the Deep Relevance Matching Model (DRMM) of Guo et al. (2016). Unlike DRMM, which used context-insensitive encodings of terms and query-document term interactions, they injected rich context-sensitive encodings throughout their models, extended in several ways including multiple views of query and document inputs. DRMM outperformed BM25-based baseline models on datasets from the BIOASQ question answering challenge, and TREC ROBUST 2004.

arxiv1809.01682-f1.png

[Image source. Click image to open in new window.]


arxiv1809.01682-f2.png

[Image source. Click image to open in new window.]


arxiv1809.01682-f3+f4.png

[Image source. Click image to open in new window.]


arxiv1809.01682-f5+f6.png

[Image source. Click image to open in new window.]


Graph-based Deep-Tree Recursive Neural Network (DTRNN) for Text Classification (Sep 2018) employed a graph representation learning approach to text classification, where the text was provided as nodes in a graph. First, their novel graph-to-tree conversion mechanism, deep-tree generation (DTG) predicted the textual data represented by graphs, generating a richer and more accurate representation for the nodes (vertices). DTG added flexibility in exploring the vertex neighborhood information, better reflecting the second order proximity and homophily [the tendency of similar people/objects to group together] equivalence in a graph. Then, a Deep-Tree Recursive Neural Network (DTRNN) method was used to classify vertices that contained text data in graphs. The model captured the neighborhood information of a node better than the traditional breath first search tree generation method. Experimental results on three citation datasets proved the effectiveness of the proposed DTRNN method, giving state of the art classification accuracy for graph structured text. They also trained graph data in the DTRNN by adding more attention models; however, the attention mechanism did not give better accuracy, because the DTRNN algorithm alone already captured more features of each node.

arxiv1809.01219-f1.png

[Image source. Click image to open in new window.]


arxiv1809.01219-f2.png

[Image source. Click image to open in new window.]


Graph Convolutional Networks for Text Classification (Oct 2018) proposed the use of graph convolutional networks (GCN) for text classification. They built a single text graph for a corpus based on word co-occurrence and document word relations, then learned a Text GCN for the corpus. Their Text GCN was initialized with one-hot representation for words and documents; it then jointly learned the embeddings for both words and documents, supervised by the known class labels for the documents. Experimental results on multiple benchmark datasets demonstrated that a vanilla Text GCN without any external word embeddings or knowledge outperformed state of the art methods for text classification. Text GCN also learned predictive word and document embeddings. Additionally, experimental results showed that the improvement of Text GCN over state of the art comparison methods became more prominent as the percentage of training data was lowered, suggesting the robustness of Text GCN to less training data in text classification.

arxiv1809.05679-f1.png

[Image source. Click image to open in new window.]


arxiv1809.05679-t2.png

[Image source. Click image to open in new window.]


arxiv1809.05679-f5+f6+t3.png

[Image source. Click image to open in new window.]


Text Classification:

Additional Reading

  • Towards Explainable NLP: A Generative Explanation Framework for Text Classification (Nov 2018)

    “Building explainable systems is a critical problem in the field of Natural Language Processing (NLP), since most machine learning models provide no explanations for the predictions. Existing approaches for explainable machine learning systems tend to focus on interpreting the outputs or the connections between inputs and outputs. However, the fine-grained information is often ignored, and the systems do not explicitly generate the human-readable explanations. To better alleviate this problem, we propose a novel generative explanation framework that learns to make classification decisions and generate fine-grained explanations at the same time. More specifically, we introduce the explainable factor and the minimum risk training approach that learn to generate more reasonable explanations. We construct two new datasets that contain summaries, rating scores, and fine-grained reasons. We conduct experiments on both datasets, comparing with several strong neural network baseline systems. Experimental results show that our method surpasses all baselines on both datasets, and is able to generate concise explanations at the same time.”

    arxiv1811.00196-f1+f2.png

    [Image source. Click image to open in new window.]


    arxiv1811.00196-t8.png

    [Image source. Click image to open in new window.]
  • AttentionXML: Extreme Multi-Label Text Classification with Multi-Label Attention Based Recurrent Neural Networks (Nov 2018)

    “Extreme multi-label text classification (XMTC) is a task for tagging each given text with the most relevant multiple labels from an extremely large-scale label set. This task can be found in many applications, such as product categorization,web page tagging, news annotation and so on. Many methods have been proposed so far for solving XMTC, while most of the existing methods use traditional bag-of-words (BOW) representation, ignoring word context as well as deep semantic information. XML-CNN, a state-of-the-art deep learning-based method, uses convolutional neural network (CNN) with dynamic pooling to process the text, going beyond the BOW-based approaches but failing to capture 1) the long-distance dependency among words and 2) different levels of importance of a word for each label. We propose a new deep learning-based method, AttentionXML, which uses bidirectional long short-term memory (LSTM) and a multi-label attention mechanism for solving the above 1st and 2nd problems, respectively. We empirically compared AttentionXML with other six state-of-the-art methods over five benchmark datasets. AttentionXML outperformed all competing methods under all experimental settings except only a couple of cases. In addition, a consensus ensemble of AttentionXML with the second best method, Parabel, could further improve the performance over all five benchmark datasets.”

    arxiv1811.01727-f1.png

    [Image source. Click image to open in new window.]


    arxiv1811.01727-t1.png

    [Image source. Click image to open in new window.]


    arxiv1811.01727-t2+t3.png

    [Image source. Click image to open in new window.]


    arxiv1811.01727-t5.png

    [Image source. Click image to open in new window.]



[Table of Contents]

Text Summarization

Approximately 1.28 million articles were added to PubMed in 2017, including ~0.36 million full-text articles added to PubMed Central, at the rate of ~3,485 new articles per day (queried 2018-06-29; see also my blog post).

  • Of those, ~122,381 included the word “cancer” in the title or abstract, i.e. ~335 papers/day (PubMed query 2017[dp] AND cancer[tiab] executed 2018-06-29; note the capitalized Boolean).

  • Narrowing the search to 2017[dp] AND 'breast cancer'[tiab] or 2017[dp] AND 'acute myeloid leukemia'[tiab] returned 16,706 and 2,030 articles (45.77 and 5.56 articles/day), respectively.

The following command-line query shows the numbers of PubMed publications per indicated year (queried on the indicated date: PubMed continually adds older, previously non-indexed articles):

With those data in ~/pm.dat, executing gnuplot -p -e 'plot "~/pm.dat" notitle' gives this plot:

pubmed_2000-2017.png

Those data show a linear increase in growth over a ~13 year span (ca. 2002-2015), tapering recently. Contrary to numerous assertions in various research papers and the media, there is no exponential growth in this literature – nevertheless, the output is staggering.

Accurate text summarization is needed to address, in part, the information overload arising from the enormous volume and overall growth of the PM/PMC biomedical literature. Text summarization generally falls into one of two categories:

  • extractive summarization, which summarizes text by copying parts of the input, and

  • abstractive summarization, which generates new phrases (possibly rephrasing or using words that were not in the original text).

Abstractive summarization tends to be more concise than extractive summarization (which tends to be more repetitive and burdened with non-relevant text). However, extractive summarization is much easier to implement, and can provide unaltered evidentiary snippets of text.

Extractive Summarization

The Word Mover's Distance (WMD) was applied the extractive summarization task in Efficient and Effective Single-Document Summarizations and A Word-Embedding Measurement of Quality (Oct 2017). WMD uses word2vec as a word embedding representation method. WMD measures the dissimilarity between two documents and calculates the minimum cumulative distance to “travel” from the embedded words of one document to the other.

arxiv1710.00284-f2.png

[Image source. Click image to open in new window.]


WMD has also been used in:

Likewise, Data-driven Summarization of Scientific Articles (Apr 2018) [datasetsslides] applied WMD to the biomedical domain, comparing that approach to others in a very interesting and revealing study on extractive and abstractive summarization. The examples presented in their Tables 4 and 5 demonstrated very clearly the differences in extractive summarization over full length articles, for title and abstract generation from the full length texts. While the results for title generation were promising, the models struggled with generating the abstract, highlighting the necessity for developing novel models capable of efficiently dealing with long input and output sequences, while at the same time preserving the quality of generated sentences.

arxiv1804.08875-t4.png

[Image source. Click image to open in new window.]


arxiv1804.08875-t5.png

[Note: partial image (continues over two PDF pp.). Image source. Click image to open in new window.]


Ranking Sentences for Extractive Summarization with Reinforcement Learning (Apr 2018) [codelive demo] conceptualized extractive summarization as a sentence ranking task. While many extractive summarization systems are trained using cross entropy loss in order to maximize the likelihood of the ground-truth labels, they do not necessarily learn to rank sentences based on their importance due to the absence of a ranking-based objective. [Cross entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.]

  • In this paper the authors argued that models trained on cross entropy training are prone to generating verbose summaries with unnecessarily long sentences and redundant information. They proposed overcoming these difficulties by globally optimizing the ROUGE evaluation metric and learning to rank sentences for summary generation through a reinforcement learning objective.

    Their neural summarization model, REFRESH (REinFoRcement Learning-based Extractive Summarization) consisted of a hierarchical document encoder and a hierarchical sentence extractor. During training, it combined the maximum-likelihood cross entropy loss with rewards from policy gradient reinforcement learning to directly optimize the evaluation metric relevant for the summarization task. The model was applied to the CNN and DailyMail datasets on which it outperformed baseline, state of the art extractive and abstractive systems when evaluated automatically and by humans. They showed that their global optimization framework rendered extractive models better at discriminating among sentences for the final summary, and that the state of the art abstractive systems evaluated lagged behind the extractive ones, when the latter are globally trained.

    arxiv1802.08636-f1.png

    [Image source. Click image to open in new window.]


    arxiv1802.08636-t1.png

    [Image source. Click image to open in new window.]


    arxiv1802.08636-f2.png

    [Image source. Click image to open in new window.]


    arxiv1802.08636-t2+t3.png

    [Image source. Click image to open in new window.]


Iterative Document Representation Learning Towards Summarization with Polishing (Sep 2018) [code] introduced Iterative Text Summarization (ITS), an iteration based model for supervised extractive text summarization, inspired by the observation that it is often necessary for a human to read an article multiple times in order to fully understand and summarize its contents. Current summarization approaches read through a document only once to generate a document representation, resulting in a sub-optimal representation. To address this issue they introduced a model which iteratively polished the document representation on many passes through the document. As part of their model, they also introduced a selective reading mechanism that decided more accurately the extent to which each sentence in the model should be updated. Experimental results on the CNN/DailyMail and DUC2002 datasets demonstrated that their model significantly outperformed state of the art extractive systems when evaluated by machines and by humans.

arxiv1809.10324-f1.png

[Image source. Click image to open in new window.]


arxiv1809.10324-t1+t2+t3.png

[Image source. Click image to open in new window.]


Comparing tables in those respective papers (above), note that ITS (Iterative Document Representation Learning Towards Summarization with Polishing) outperformed REFRESH (Ranking Sentences for Extractive Summarization with Reinforcement Learning).

REFRESH_v_ITS.png

[Click image to open in new window.]


Extractive Summarization:

Additional Reading

  • Semantic Sentence Embeddings for Paraphrasing and Text Summarization (Sep 2108)

    “We showed the use of a deep LSTM based model in a sequence learning problem to encode sentences with common semantic information to similar vector representations. The presented latent representation of sentences has been shown useful for sentence paraphrasing and document summarization. We believe that reversing the encoder sentences helped the model learn long dependencies over long sentences. One of the advantages of our simple and straightforward representation is the applicability into a variety of tasks. Further research in this area can lead into higher quality vector representations that can be used for more challenging sequence learning tasks.”

    arxiv1809.10267-f1+f2.png

    [Image source. Click image to open in new window.]


    arxiv1809.10267-f3+f4.png

    [Image source. Click image to open in new window.]
  • The limits of automatic summarisation according to ROUGE (2017)

    “This paper discusses some central caveats of summarisation, incurred in the use of the ROUGE metric for evaluation, with respect to optimal solutions. The task is NP-hard, of which we give the first proof. Still, as we show empirically for three central benchmark datasets for the task, greedy algorithms empirically seem to perform optimally according to the metric. Additionally, overall quality assurance is problematic: there is no natural upper bound on the quality of summarisation systems, and even humans are excluded from performing optimal summarisation.”

[Table of Contents]

Probing the Effectiveness of Extractive Summarization

Content Selection in Deep Learning Models of Summarization (Oct 2018) [code] experimented with deep learning models of summarization across the domains of news, personal stories, meetings, and medical articles in order to understand how content selection was performed. They found that many sophisticated features of state of the art extractive summarizers did not improve performance over simpler models, suggesting that it is easier to create a summarizer for a new domain than previous work suggests, and bringing into question the benefit of deep learning models for summarization for those domains that do have massive datasets (i.e., news). At the same time, they suggested important questions for new research in summarization; namely, new forms of sentence representations, or external knowledge sources are needed that are better suited to the summarization task.

arxiv1810.12343-t1.png

[Image source. Click image to open in new window.]


PubMed. We created a corpus of 25,000 randomly sampled medical journal articles from the PubMed Open Access Subset 6 . We only included articles if they were at least 1000 words long and had an abstract of at least 50 words in length. We used the article abstracts as the ground truth human summaries.”

arxiv1810.12343-t2.png

[Image source. Click image to open in new window.]


arxiv1810.12343-t3+t4.png

[Image source. Click image to open in new window.]


arxiv1810.12343-t5+t6.png

[Image source. Click image to open in new window.]



Abstractive Summarization

In Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond (Aug 2016) [non-author code], researchers at IBM Watson (Ramesh Nallapati et al.) and the Université de Montréal (Caglar Gulcehre) modeled abstractive text summarization using attentional encoder-decoder recurrent neural networks.

arxiv1602.06023-f3.png

[Image source. Click image to open in new window.]


That approach was extended by Richard Socher and colleagues at SalesForce in A Deep Reinforced Model for Abstractive Summarization (Nov 2017), which described a sophisticated, highly performant reinforcement learning-based system for abstractive text summarization that set the state of the art in this domain, circa mid-2017:

Socher’s work also used an attention mechanism and a new machine learning objective to address the “repeating phrase” problem, via:

  • an intra-temporal attention mechanism in the bidirectional long short-term memory (Bi-LSTM) encoder that recorded previous attention weights for each of the input tokens (words), while a sequential intra-attention model in the LSTM decoder took into account which words had already been generated by the decoder – i.e., an encoder-decoder network; and,

  • a new objective function that combined the maximum-likelihood cross entropy loss, used in prior work with rewards from policy gradient reinforcement learning, to reduce exposure bias.

    arxiv1705.04304-f1.png

    [Image source. Click image to open in new window.]


    arxiv1705.04304-t1+t2+t3.png

    [Image source. Click image to open in new window.]


The encoder-decoder employed in Socher's work allowed the model to generate new words that were not part of the input article, while the copy-mechanism allowed the model to copy over important details from the input even if these symbols were rare in the training corpus. At each decoding step the intra-temporal attention function attended over specific parts of the encoded input sequence in addition to the decoder’s own hidden state and the previously-generated word. This kind of attention prevented the model from attending over the same parts of the input on different decoding steps. Intra-temporal attention could also reduce the amount of repetition when attending over long documents.

  • While this intra-temporal attention function ensured that different parts of the encoded input sequence were used, their decoder could still generate repeated phrases based on its own hidden states, especially when generating long sequences. To prevent that, the authors incorporated more information about the previously decoded sequence into the decoder. To generate a token [i.e. word], the decoder used either a token-generation softmax layer or a pointer mechanism to copy rare or unseen tokens from the input sequence. [In this regard, note that the probabilistic fastText algorithm could also deal with rare and out-of-vocabulary (OOV) words.] A switch function decided at each decoding step whether to use the token generation, or the pointer.

    • A proprietary system, code for this work is not available, but there are four Python implementations available on GitHub (keyphrase search “Deep Reinforced Model for Abstractive Summarization”), as well as an OpenNMT implementation, that also links to GitHub.

Follow-on work (2018) by Socher and colleagues included Improving Abstraction in Text Summarization (Aug 2018) which proposed two techniques to improve the level of abstraction of generated summaries. First, they decomposed the decoder into a contextual network that retrieved relevant parts of the source document, and a pretrained language model that incorporates prior knowledge about language generation. The contextual network had the sole responsibility of extracting and compacting the source document, whereas the language model is responsible for the generation of concise paraphrases. Second, they proposed a novelty metric that was optimized directly through policy learning (a reinforcement learning reward) to encourage the generation of novel phrases (summary abstraction).

arxiv1808.07913-f1.png

[Image source. Click image to open in new window.]


arxiv1808.07913-t1.png

[Image source. Click image to open in new window.]


arxiv1808.07913-t2.png

[Image source. Click image to open in new window.]


In related work – described by Junyang Lin et al. in Global Encoding for Abstractive Summarization (Jun 2018) [code] – researchers at Peking University developed a model with an encoder similar to that employed in the Socher/SalesForce approach (above), employing a Bi-LSTM decoder that generated summary words. Their approach differed from Socher’s method [not cited] in that Lin et al. fed their encoder output at each time step into a convolutional gated unit, that with a self-attention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global source-side information. Self-attention encouraged the model to learn long-term dependencies, without creating much computational complexity. Since the convolutional module could extract n-gram features of the whole source text and self-attention learned the long-term dependencies among the components of the input source text, the gate (based on the generation from the CNN and self-attention module for the source representations from the RNN encoder) could perform global encoding on the encoder outputs. Based on the output of the CNN and self-attention, the logistic sigmoid function outputted a vector of value between 0 and 1 at each dimension. If the value was close to 0, the gate removed most of the information at the corresponding dimension of the source representation, and if it was close to 1 it reserved most of the information. The model thus performed neural abstractive summarization through a global encoding framework, which controlled the information flow from the encoder to the decoder based on the global information of the source context, generating summaries of higher quality while reducing repetition.

arxiv1805.03989-t1+f1.png

[Image source. Click image to open in new window.]


arxiv1805.03989-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1805.03989-t4+f2.png

[Image source. Click image to open in new window.]


Christopher Manning’s group at Stanford University, in collaboration with Google Brain, also employed pointer-generator networks (used by Socher/Salesforce, above) in their well-cited abstractive summarization method, Get to the Point: Summarization with Pointer-Generator Networks. Coauthor Abigail See discussed this work in her excellent post Taming Recurrent Neural Networks for Better Summarization. This approach first used a hybrid pointer-generator network which could copy words from the source text via pointing, that aided accurate reproduction of information while retaining the ability to produce novel words through the generator. The approach then used “coverage” to keep track of what had been summarized, which discouraged repetition. [“Coverage” refers to the coverage vector
Source
[see Tu et al., Modeling Coverage for Neural Machine Translation (Aug 2016)], which keeps track of the attention history.]

Although it is a commercial (closed source) project, Primer.ai’s August 2018 blog article Machine-Generated Knowledge Bases introduced an abstractive summarization approach that was applied to create “missing” biographies that should exist in Wikipedia, including an interesting product demonstration. That approach could assist with addressing the information overload associated with the volume of the PubMed/PubMed Central literature; the tools they used (TLDR: Re-Imagining Automatic Text SummarizationBuilding seq-to-seq Models in TensorFlow (and Training Them Quickly] could be implemented relatively easily via the approaches described in this REVIEW. Much of that work, for example, is based on See and Manning’s Get To The Point: Summarization with Pointer-Generator Networks approach, discussed in the preceding paragraph; consequently, approaches to reimplement/extend Primer.ai’s Quicksilver abstractive summarization project via seq2seq models with attention are well in hand.

A very interesting and promising project from Google Brain, Generating Wikipedia by Summarizing Long Sequences (Jan 2018) [codeOpenReviewmedia], considered English Wikipedia as a supervised machine learning task for multi-document summarization where the input was comprised of a Wikipedia topic (title of article) and a collection of non-Wikipedia reference documents, and the target was the Wikipedia article text. They described the first attempt to abstractively generate the first section (lead) of Wikipedia articles conditioned on reference text. They used extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. In addition to running strong baseline models on the task, they modified their Transformer architecture to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. They showed that their modeling improvements allowed them to generate entire Wikipedia articles.

arxiv1801.10198-f1.png

[Image source. Click image to open in new window.]


arxiv1801.10198-f4.png

[Image source. Click image to open in new window.]


arxiv1801.10198-f6.png

[Image source. Click image to open in new window.]


Another very interesting paper, Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization (August 2018) [dataset, code, and demo] introduced extreme summarization (XSum), a new single-document summarization task which did not favor extractive strategies and called for an abstractive modeling approach. The idea was to create a short, one-sentence news summary answering the question “What is the article about?”. Their novel abstractive model, conditioned on the article’s topics, was based entirely on CNN. They demonstrated that this architecture captured long-range dependencies in a document and recognized pertinent content, outperforming an oracle extractive system and state of the art abstractive approaches when evaluated automatically, and by humans. The example illustrated in their Fig. 1 is very impressive, indeed (note. e.g., the substitution of “a small recreational plane” with “light”):

arxiv1808.08745-f2-edited.png

[Image source. Click image to open in new window.]


arxiv-1808.08745

[Image source. Click image to open in new window.]


As can be seen in the image above, the summary is very different from a headline; it drew on information interspersed in various parts of the document and displays multiple levels of abstraction including paraphrasing, fusion, synthesis, and inference. That work built upon a dataset for the proposed task by harvesting online news articles that often included a first- sentence summary. They further proposed a novel deep learning model for the extreme summarization task: unlike most existing abstractive approaches which relied on RNN based encoder-decoder architectures, they presented a topic-conditioned neural model which was based entirely on CNN. Convolution layers captured long-range dependencies between words in the document more effectively than RNN, allowing it to perform document-level inference, abstraction, and paraphrasing. The convolutional encoder associated each word with a topic vector (capturing whether it was representative of the document’s content), while the convolutional decoder conditioned each word prediction on a document topic vector.

Abstractive Summarization:

Additional Reading

  • The Rule of Three: Abstractive Text Summarization in Three Bullet Points (Sep 2018)  [dataset]

    “In this study, we constructed a dataset focused on summaries with three sentences. We annotated and analyzed the structure of the summaries in the considered dataset. In particular, we proposed a structure-aware summarization model combining the summary structure classification model and summary-specific summarization sub-models. Through our experiment, we demonstrated that our proposed model improves summarization performance over the baseline model.”

  • Bidirectional Attentional Encoder-Decoder Model and Bidirectional Beam Search for Abstractive Summarization (Sep 2018)

    “In this work, we used a bidirectional encoder-decoder architecture; each of which is a bidirectional recurrent neural network consists of two recurrent layers, one for learning history textual context and the other for learning future textual context. The output of the forward encoder was fed as input into the backward decoder while the output of the backward encoder was fed into the forward decoder. Then, a bidirectional beam search mechanism is used to generate tokens for the final summary one at a time. The experimental results have shown the effectiveness and the superiority of the proposed model compared to the state of the art models. Even though the pointer-generator network has alleviated the OOV problem, finding a way to tackle the problem while encouraging the model to generate summaries with more novelty and high level of abstraction is an exciting research problem. Furthermore, we believe that there is a real need to propose an evaluation metric besides ROUGE to optimize on summarization models, especially for long sequences.”

  • Abstractive Summarization Using Attentive Neural Techniques (Oct 2018) modified and optimized a translation model with self-attention for generating abstractive sentence summaries. The effectiveness of this base model along with attention variants was compared and analyzed in the context of standardized evaluation sets and test metrics. However, those metrics were found to be limited in their ability to effectively score abstractive summaries, and the authors proposed a new approach, based on the intuition that an abstractive model requires an abstractive evaluation.

    • “To improve the quality of summary evaluation, we introduce the “VERT” metric [GitHub], an evaluation tool that scores the quality of a generated hypothesis summary as compared to a reference target summary. …

    • “The effect of modern attention mechanisms as applied to sentence summarization has been tested and analyzed. We have shown that a self-attentive encoder-decoder can perform the sentence summarization task without the use of recurrence or convolutions, which are the primary mechanisms in state of the art summarization approaches today. An inherent limitation of these existing systems is the computational cost of training associated with recurrence. The models presented can be trained on the full Gigaword dataset in just 4 hours on a single GPU. Our relative dot-product self-attention model generated the highest quality summaries among our tested models and displayed the ability of abstracting and reducing complex dependencies. We also have shown that n- gram evaluation using ROUGE metrics falls short in judging the quality of abstractive summaries. The VERT metric has been proposed as an alternative to evaluate future automatic summarization based on the premise that an abstractive summary should be judged in an abstractive manner.”

  • Abstractive Summarization of Reddit Posts with Multi-level Memory Networks (Nov 2018)

    “We address the problem of abstractive summarization in two directions: proposing a novel dataset and a new model. First, we collect Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit. We use such informal crowd-generated posts as text source, because we empirically observe that existing datasets mostly use formal documents as source text such as news articles; thus, they could suffer from some biases that key sentences usually located at the beginning of the text and favorable summary candidates are already inside the text in nearly exact forms. Such biases can not only be structural clues of which extractive methods better take advantage, but also be obstacles that hinder abstractive methods from learning their text abstraction capability. Second, we propose a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi-level memory to store the information of text from different levels of abstraction. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the Reddit TIFU dataset is highly abstractive and the MMN outperforms the state-of-the-art summarization models.”

    • “Most abstractive summarization methods employ sequence-to-sequence (seq2seq ) models where an RNN encoder embeds an input document and another RNN decodes a summary sentence. Our MMN has two major advantages over seq2seq-based models. First, RNNs accumulate information in a few fixed-length memories at every step regardless of the length of an input sequence, and thus may fail to utilize far-distant information due to vanishing gradient. … Second, RNNs cannot build representations of different ranges, since hidden states are sequentially connected over the whole sequence. This still holds even with hierarchical RNNs that can learn multiple levels of representation. In contrast, our model exploits a set of convolution operations with different receptive fields; hence, it can build representations of not only multiple levels but also multiple ranges (e.g. sentences, paragraphs, and the whole document). …”

    arxiv1811.00783-f3+f4.png

    [Image source. Click image to open in new window.]


    arxiv1811.00783-f1+f5.png

    [Image source. Click image to open in new window.]


    Additional examples
    Source

[Table of Contents]

KNOWLEDGE GRAPHS

[Table of Contents]

Knowledge Graph Construction and Completion

Knowledge graphs (relational property graphs) model information in the form of entities (nodes/vertices) and the relationships between them (edges). For a formal definition see the “Problem Definition” subsection in Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs, and for a general overview see Towards a Definition of Knowledge Graphs.

Knowledge graphs (KG) provide semantically structured information that is interpretable by computers – an important property to building more intelligent machines, as well as an important step to transforming text-based knowledge stores and search engines into semantically-aware question answering services [A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015)].

Knowledge graphs are applicable to a broad spectrum of problems ranging from biochemistry to recommender systems; for example, question answering, structured search, exploratory search, and digital assistants. Examples of the use of KG in the biomedical domain include:

A noteworthy feature of knowledge graphs is the excellent performance of traversal-type queries across entities of diverse types within them. Such queries can be challenging to realize in relational databases because of the cost of computing statements (joins) across different tables. See:

Cypher queries in Neo4j are easier to write and understand than complex SQL queries in relational database management systems (RDBMS), especially those involving multiple join statements. For example, see pp. 22-23 in The Definitive Guide to Graph Databases for the RDBMS Developer).

Graphs represent a recent and exciting extension of machine learning; for a good review, see The Knowledge Graph as the Default Data Model for Learning on Heterogeneous Knowledge. Although a typical knowledge graph may contain millions of entities and billions of relational facts, it is usually far from complete. Determining the validity of information in a knowledge graph and filling in its missing parts are crucial tasks for many researchers and practitioners.

Traditional fact checking by experts and analysts cannot keep pace with the volume of newly created information. It is therefore important and necessary to enhance our ability to computationally determine whether statements of fact are true or false. This problem may be viewed as a link prediction task. Knowledge graph completion also aims at predicting relations between entities under supervision of the existing knowledge graph, which is an important supplement to relation extraction from plain texts. To address those challenges, a number of knowledge graph completion methods have been developed using link prediction and low-dimensional graph embeddings.

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Aug 2018) [data/code] introduced a joint multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. They created “SciERC,” a dataset that included annotations for all three tasks, and developed a unified framework called Scientific Information Extractor (SciIE) with shared span representations. The multi-task setup reduced cascading errors between tasks and leveraged cross-sentence relations through coreference links. Their multi-task model was better at predicting span boundaries and outperformed previous state of the art scientific IE systems on entity and relation extraction, without using any hand-engineered features or pipeline processing.

arxiv1808.09602-f2.png

[Image source. Click image to open in new window.]


arxiv1808.09602-f1+f3+f4.png

[Image source. Click image to open in new window.]


  • SciIE was able to automatically organize the extracted information from a large collection of scientific articles into a knowledge graph. SciIE supported the construction of a scientific knowledge graph, which they used to analyze information in scientific literature. Their analysis showed the importance of coreference links in making a dense, useful graph. Experiments showed that their multi-task model outperformed previous models in scientific information extraction without using any domain-specific features.
Knowledge Graph Construction and Completion:

Additional Reading

  • Augmenting Compositional Models for Knowledge Base Completion Using Gradient Representations (Nov 2018)

    “Neural models of Knowledge Base data have typically employed compositional representations of graph objects: entity and relation embeddings are systematically combined to evaluate the truth of a candidate Knowledge Base entry. Using a model inspired by Harmonic Grammar, we propose to tokenize triplet embeddings by subjecting them to a process of optimization with respect to learned well-formedness conditions on Knowledge Base triplets. The resulting model, known as Gradient Graphs, leads to sizable improvements when implemented as a companion to compositional models. Also, we show that the “supracompositional” triplet token embeddings it produces have interpretable properties that prove helpful in performing inference on the resulting triplet representations.”

    arxiv1811.01062-f1.png

    [Image source. Click image to open in new window.]


    arxiv1811.01062-t1.png

    [Image source. Click image to open in new window.]

[Table of Contents]

Statistical Relational Learning

A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015) – by Maximilian Nickel, Kevin Murphy, Volker Tresp and Evgeniy Gabrilovich – provides an excellent introduction to machine learning and knowledge graphs, with a focus on statistical relational learning. The authors demonstrated how statistical relational learning can be used in conjunction with machine reading and information extraction methods to automatically build KG.

arxiv1503.00759-f1+f2.png

[Image source. Click image to open in new window.]


arxiv1503.00759-f4.png

[Image source. See Section IV (Latent Feature Models), pp. 6-10 for details. [Essentially, the relationships may be modeled/represented as tensors (vectors), amenable to matrix factorization methods. ] Click image to open in new window.]


  • “knowledge graphs have found important applications in question answering, structured search, exploratory search, and digital assistants. We provided a review of state of the art statistical relational learning methods applied to very large knowledge graphs. We also demonstrated how statistical relational learning can be used in conjunction with machine reading and information extraction methods to automatically build such knowledge repositories. As a result, we showed how to create a truly massive, machine-interpretable ‘semantic memory’ of facts, which is already empowering numerous practical applications. …”

In statistical relational learning the representation of an object can contain its relationships to other objects. The data is in the form of a graph, consisting of nodes (entities) and labelled edges (relationships between entities). The main goals of statistical relational learning include prediction of missing edges, prediction of properties of nodes, and clustering nodes based on their connectivity patterns. These tasks arise in many settings such as analysis of social networks and biological pathways.

  • Statistical relational learning (sometimes called Relational Machine Learning, RML), is a subdiscipline of machine learning that is concerned with domain models that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in statistical relational learning use (a subset of) first-order logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming.

    The field of statistical relational learning is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and first-order probabilistic languages (emphasizing the key properties of the languages with which models are represented).

    • Statistical relational learning should not be confused with Semantic Role Labeling (sometimes also called shallow semantic parsing); unfortunately, both are abbreviated as SRL. Semantic role labeling is a process in NLP that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles (i.e., automatically finding the semantic roles of each argument of each predicate in a sentence).

Ontology reasoning is the ephemeral name given to reasoning over ontologies and knowledge bases: deriving facts that are not explicitly expressed in an ontology or in knowledge base. Ontology reasoning describes problem settings where the rules for conducting reasoning are specified alongside with the actual information. In 2017 Hohenecker and Lukasiewicz (University of Oxford) presented a novel approach to ontology reasoning that was based on deep learning rather than logic-based formal reasoning [Deep Learning for Ontology Reasoning (May 2017)]. They introduced a new model for statistical relational learning built upon deep recursive neural networks which easily competed with existing logic-based reasoners on the task of ontology reasoning on “RDFox” (an in-memory RDF triple store, i.e. graph, being developed as a commercial product).

In their 2018 follow-on work, Ontology Reasoning with Deep Neural Networks (Sep 2018), Hohenecker and Lukasiewicz devised a novel model that was closely coupled to symbolic reasoning methods and thus able to learn how to effectively perform basic ontology reasoning. Tests showed that their model learned to perform precise reasoning on a number of diverse inference tasks that required comprehensive deductive proficiencies; furthermore, the suggested model suffered much less from different obstacles that prohibit symbolic reasoning.

arxiv1808.07980-f1.png

[Image source. Click image to open in new window.]


arxiv1808.07980-f2.png

[image source. click image to open in new window.]


arxiv1808.07980-f3.png

[Image source. Click image to open in new window.]


arxiv1808.07980-f4.png

[Image source. Click image to open in new window.]


[Table of Contents]

Knowledge Graph Embedding

In a radically different approach from the probabilistic methods employed in statistical relational learning, knowledge graph embedding (KGE) aims to represent entities and relations in a knowledge graph as points or vectors in a continuous vector space (Euclidean space) – simplifying manipulation, while preserving the inherent structure of the KG. For recent reviews see:

  • A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications (Feb 2018)

    arxiv1709.07604-f1.png

    [Image source. Click image to open in new window.]


  • Knowledge Graph Embedding: A Survey of Approaches and Applications (Sep 2017; pdf).

    Wang2017Knowledge-f1.png

    [Image source. Click image to open in new window.]


    Wang2017Knowledge-f2+t1.png

    [Image source. Click image to open in new window.]


    Wang2017Knowledge-f4.png

    [Image source. Click image to open in new window.]


A typical KG is represented as a graph built from symbolic triples – (source, relation, target), i.e. (subject, predicate, object), or (head, relation, tail) – whereas KGE methods attempt to represent those symbols (nodes and edges) with their corresponding source, relation, and target vectors – amenable to mathematical processing.

rdbms_vs_graph.png

[Image source (slide 7). Click image to open in new window.]


arxiv1611.05425-f1.png

[Image source. Click image to open in new window.]


arxiv1710.05980-f2.png

[Image source. Click image to open in new window.]


The underlying concept of KGE is that in a knowledge graph each entity can be regarded as a point in a continuous vector space while relations can be modelled as translation vectors [Expeditious Generation of Knowledge Graph Embeddings (Mar 2018)]. The generated vector representations can be used by machine learning algorithms to accomplish a specific tasks.

Restated, KGE aim to represent instances and their relationships as vectors and/or matrices in the Euclidean space (On Embeddings as Alternative Paradigm for Relational Learning). The hope is that the geometry of the embedding space will resemble the structure of the data by keeping the instances that participate in the same relationships close to one another in Euclidean space. This in turn allows one to apply standard propositional (logic-based) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data.

  • Euclidean space is a mathematical construct that encompasses the line, the plane, and three-dimensional space as special cases. Its elements are called vectors. Vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.

    Vectors can be understood in various ways: as arrows, as quantities with magnitude and direction, as displacements, or as points. Euclidean vectors are an example of a vector space. They represent physical quantities such as forces: any two forces (of the same type) can be added to yield a third, and the multiplication of a force vector by a real multiplier is another force vector. In the same vein, but in a more geometric sense, vectors representing displacements in the plane or in three-dimensional space also form vector spaces.

    Vectors are abstract mathematical objects with particular properties, which in some cases can be visualized as arrows. However, vectors in vector spaces do not necessarily have to be arrow-like objects. Vector spaces are the subject of linear algebra and are well characterized by their dimension, which, roughly speaking, specifies the number of independent directions in the space.

  • A recent Microsoft/Google paper, Link Prediction using Embedded Knowledge Graphs (Apr 2018), discussed the differences between symbolic and vector space
    Source
    in KG completion; instead of sampling branching paths in the symbolic space of the original knowledge graph, they performed short sequences of interactive lookup operations in the vector space of an embedded knowledge graph.

KGE [reviewed in An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion (Mar 2017; updated Feb 2018)] has proven to be very effective for the task of knowledge graph completion, where the goal is to identify missing links in the existing knowledge graph [A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015)]. Accordingly, KGE approaches are the current (2018) dominating methodology for knowledge graph link prediction [On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols (2018)]. KGE – fundamentally based on distributed representations – has not only proved to be effective for KG link prediction, but has also helped to improve our understanding and engineering of knowledge representation. A strong advantage of KGE is their scalability (at the expense of their black-box nature and limited reasoning capabilities): KGE has proven to be scalable to very large knowledge graphs.

A well known KGE method is TransE – by Antoine Bordes et al. at the Université de Technologie de Compiègne – CNRS, and Jason Weston and Oksana Yakhnenko at Google – which embedded entities and relations into the same space where the difference between the head and the tail was approximately the relation [Translating Embeddings for Modeling Multi-relational Data (NIPS 2013)].

TransE adopted the principle of geometric translation, formally as $\small h + r ≈ t$. While this embedding permits very simple translation-based relational inference, it is too restrictive in dealing with $\small \text{1-to-N}$, $\small \text{N-to-1}$ and $\small \text{N-to-N}$ relations. Furthermore, the representation of these models is still not semantically interpretable, which may be a major flaw and harm the potential applications (as stated in KSR: A Semantic Representation of Knowledge Graph within a Novel Unsupervised Paradigm).

In the traditional methods it is almost impossible to exactly extract the specific semantics from geometric points; for the example of TransE, a representation of $\small \text{Table}$ as $\small (0.82, 0.51, …)$ hardly tells us anything meaningful – such as a $\small \text{Table}$ being $\small \text{furniture}$, not being an $\small \text{animal}$, etc. Without semantics, the gap between knowledge and language remains, limiting the task of knowledge applications and natural language understanding (NLU).

One effective solution is to consider two separate embedding spaces for entities and relations. Entities are then mapped into the relation space using relation-specific projections, such as those in TransR [Learning Entity and Relation Embeddings for Knowledge Graph Completion (Jan 2018)]. This mapping strategy, however, causes critical drawbacks. First, when the number of relations is large, the whole projection matrices are expensive to model. Second, treating each relation separately does not account for the latent structure in the relation space, leading to waste of resources. An example of such a latent structure is the correlation between relations “nationality” and “place-of-birth”, as the latter may infer about the former.

While the raw representation of KG as (head, relation, tail) triples is adequate to store known knowledge, relating distant entities requires expensive graph traversal, possibly through multiple paths. Thus knowledge graph completion calls for learning of a new representation that supports scalable reasoning. The most successful approach thus far is through embedding entities and relations into a continuous vector space, which naturally lends itself to simple algebraic manipulations. An effective solution is to consider two separate embedding spaces for entities and relations. Entities are then mapped into the relation space using relation-specific projections , such as those in TransR. However, this mapping strategy causes critical drawbacks. First, when the number of relations is large, the whole projection matrices are expensive to model. Second, treating each relation separately does not account for the latent structure in the relation space, leading to waste of resources. An example of such a latent structure is the correlation between relations ‘nationality’ and ‘place-of-birth,’ as the latter may infer knowledge about the former.

Knowledge Graph Embedding with Multiple Relation Projections (Jan 2018) proposed a new translation-based KGE method called TransF, which was inspired by TransR but did not suffer from those issues. Under TransF, projection matrices are members of a matrix space spanned by a fixed number of matrix bases. A relation-specific projection matrix is characterized by a relation-specific coordinate in the space. Put in other way, the relation projection tensor is factorized into product of a relation coordinate matrix and a basis tensor. Hence, TransF is much more efficient and robust than TransR.

arxiv1801.08641-f1.png

[Image source. Click image to open in new window.]


arxiv1801.08641-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1801.08641-f2.png

[Image source. Click image to open in new window.]


While KG structures used for representation learning and other purposes generally represent (subject, relation, object) relational triples, knowledge bases often contain a wide variety of data types beyond these direct links. Apart from relations to a fixed set of entities, KB often not only include numerical attributes (e.g. ages, dates, financial, etc.) but also textual attributes (e.g. names, descriptions, and titles/designations) and images (profile photos, flags, posters, etc.). These different types of data can play a crucial role as extra pieces of evidence for knowledge base completion. For example the textual descriptions and images might provide evidence for a person’s age, profession, and designation.

arxiv1809.01341-f1.png

[Image source. Click image to open in new window.]


Embedding Multimodal Relational Data for Knowledge Base Completion (Sep 2018) [code] proposed multimodal knowledge base embeddings (MKBE) that used different neural encoders for this variety of observed data, combining them with existing relational models to learn embeddings of the entities and multimodal data. Furthermore, those learned embeddings and different neural decoders were used to develop a novel multimodal imputation model to generate missing multimodal values (like text and images) from information in the knowledge base.

arxiv1809.01341-f2.png

[Image source. Click image to open in new window.]


Most previous research in KG completion has focused on the problem of inferring missing entities and missing relation types between entities. However, in addition to these many KG also suffer from missing entity types (i.e. the category labels for entities, such as “/music/artist”). Learning Entity Type Embeddings for Knowledge Graph Completion (Nov 2017; pdf) [code] addressed this issue, proposing a novel approach to entity type prediction.

Moon-2017-entity_embeddings.png

[Image source. Click image to open in new window.]


DOLORES: Deep Contextualized Knowledge Graph Embeddings (Nov 2018) introduced a new method, DOLORES, for learning knowledge graph embeddings that effectively captured contextual cues and dependencies among entities and relations. Short paths on knowledge graphs comprising of chains of entities and relations can encode valuable information regarding their contextual usage. They operationalized this notion by representing knowledge graphs not as a collection of triples but as a collection of entity-relation chains, and learned embeddings for entities and relations using deep neural models that captured such contextual usage. Their model was based on bi-directional LSTMs, and learned deep representations of entities and relations from constructed entity-relation chains. They showed that these representations could very easily be incorporated into existing models, significantly advancing the state of the art on several knowledge graph prediction tasks like link prediction, triple classification, and missing relation type prediction (in some cases by at least 9.5%).

For a related paper by the same authors contemporaneously released on arXiv, see *MOHONE*: Modeling Higher Order Network Effects in Knowledge Graphs via Network Infused Embeddings.

arxiv1811.00147-f1.png

[Image source. Click image to open in new window.]


arxiv1811.00147-f2.png

[Image source. Click image to open in new window.]


arxiv1811.00147-t1+t2.png

[Image source. Click image to open in new window.]


arxiv1811.00147-f3+t3+t4.png

[Image source. Click image to open in new window.]


Knowledge graph embedding aims at modeling entities and relations with low-dimensional vectors. Most previous methods require that all entities should be seen during training, which is unpractical for real-world knowledge graphs with new entities emerging on a daily basis. Recent efforts on this issue suggest training a neighborhood aggregator in conjunction with the conventional entity and relation embeddings, which may help embed new entities inductively via their existing neighbors. However, their neighborhood aggregators neglect the unordered and unequal natures of an entity’s neighbors. To this end, Logic Attention Based Neighborhood Aggregation for Inductive Knowledge Graph Embedding (Nov 2018) summarized the desired properties that may lead to effective neighborhood aggregators, and introduced a novel aggregator: Logic Attention Network (LAN), which addressed the properties by aggregating neighbors with both rules-based and network-based attention weights. By comparing with conventional aggregators on two knowledge graph completion tasks, they experimentally validated LAN’s superiority in terms of the desired properties.

arxiv1811.01399-f1+f2.png

[Image source. Click image to open in new window.]


arxiv1811.01399-t3+t4+t5.png

[Image source. Click image to open in new window.]


arxiv1811.01399-t6.png

[Image source. Click image to open in new window.]


[Table of Contents]

Probing the Effectiveness of Embedding Models for Knowledge Graph Completion

Knowledge bases (graphs) contribute to many artificial intelligence tasks, yet they are often incomplete. To add missing facts to a given knowledge base, various embedding models have been proposed. Perhaps surprisingly, relatively simple models with limited expressiveness often performed remarkably well under most commonly used evaluation protocols. On Evaluating Embedding Models for Knowledge Base Completion (Oct 2018) explored whether recent embedding models work well for knowledge base completion tasks and argue that the current evaluation protocols are more suited for question answering rather than knowledge base completion. They showed that when using an alternative evaluation protocol more suitable for knowledge base completion the performance of all models was unsatisfactory, indicating the need for more research into embedding models and evaluation protocols for knowledge base completion.

arxiv1810.07180-f1.png

[Image source. Click image to open in new window.]


arxiv1810.07180-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1810.07180-t4.png

[Image source. Click image to open in new window.]


Current Evaluation Protocols. Most studies use the triple classification (TC) and the entity ranking (ER) protocols to assess model performance, where ER is arguably the most widely adopted protocol. We assume throughout that only true but no false triples are available (as is commonly the case), and that these triples are divided into training, validation, and test triples. The union of these three sets acts as a proxy of the entire KB, which is unknown due to incompleteness.”

Triple classification (TC). The goal of TC is to test the model’s ability to discriminate between true and false triples [Socher et al., Reasoning With Neural Tensor Networks for Knowledge Base Completion (2013)]. Since only true triples are available in practice, pseudo-negative triples are generated by randomly replacing either the subject or the object of each test triple by another random entity that appears as a subject or object, respectively. Each resulting triple is then classified as positive or negative. In particular, triple $\small (i,k,j)$ is classified as positive if its score $\small s(i,k,j)$ exceeds a relation-specific decision threshold $\small \sigma_k$ (learned on validation data using the same procedure). Model performance is assessed by its accuracy, i.e., how many triples are classified correctly.”

Socher2013reasoning-f1.png

[Image source. Click image to open in new window.]


Socher2013reasoning-f2.png

[Image source. Click image to open in new window.]


Socher2013reasoning-f5.png

[Image source. Click image to open in new window.]


Entity ranking (ER). The goal of ER is to assess model performance in terms of ranking answers to certain questions. In particular, for each test triple $\small t = (i,k,j)$, two questions $\small q_s = (?,k,j)$ and $\small q_o = (i,k,?)$ are generated. For question $\small q_s$, all entities $\small i’ \in \mathcal{E}$ are ranked based on the score $\small s(i’,k,j)$ . To avoid misleading results, entities $\small i’$ that correspond to observed triples in the dataset (i.e., $\small (i’,k,j)$ in train/validate) are discarded to obtain a filtered ranking. The same process is applied for question $\small q_o$. Model performance is evaluated based on the recorded positions of the test triples in the filtered ranking. The intuition is that models that rank test triples (which are known to be true) higher are expected to be superior. Usually, the micro-average of $\small \text{filtered Hits@K}$ (i.e., the proportion of test triples ranking in the top-$\small K$) and $\small \text{filtered MRR}$ (i.e., the mean reciprocal rank of the test triples) are reported. Figure 1a
Source
provides a pictorial view of ER for a single relation. Given the score matrix of a relation $\small k$, where $\small s_{ij}$ is the score of triple $\small (i,k,j)$, a single test triple is shown in green, all candidate triples considered during the evaluation are shown in blue, and all triples observed in the training, validation and testing sets (not considered during evaluation) are shown in grey.”

Discussion. Regarding triple classification, Wang, Ye, and Gupta (Apr 2018) found that most models achieve an accuracy of at least 93%. This is due to the fact that negative triples with high score are rarely sampled as pseudo-negative triples because of the large number of entities from which the single replacement entity is picked for a given test triple. This means that most classification tasks are ‘easy.’ Consequently, the accuracy of triple classification overestimates model performance for KBC tasks. This protocol is less adopted in recent work. We argue that ER also overestimates model performance for KBC. In particular, the protocol is more appropriate to evaluate question answering tasks. Since ER generates questions from true test triples, it only asks questions that are known to have an answer. The question itself leaks this information from the test data into the evaluation. …”

Entity-pair ranking (PR). To study the overestimation effect of current evaluation protocols, we used an evaluation protocol for KBC termed entity-pair ranking (PR). PR is simple: for each relation $\small k$, we ask question $\small (?,k,?)$ . As before, we use the model to rank all answers, i.e., pairs of entities, and filter out training and validation data in the ranking so as to rank only triples not used during model training. In this way, any negative triples with a high score will appear at a top position, making it harder for true triples to rank high. Figure 1b
Source
shows the contrast between the number of negative triples considered for entity-pair and those considered for ER. Again, test triples are shown in green, candidate triples are shown in blue, and triples observed during training and validation are shown in grey. The number of candidates is much higher than those considered for ER. However, when answering the question $\small (?,k,?)$ with all possible entity pairs, all test triples for relation $\small k$ will be ranked simultaneously. Let $\small |\mathcal{T}_k|$ be the number of test triples in $\small k$ . ER needs to consider in total $\small 2 \cdot |\mathcal{T}_k| \cdot |\mathcal{E}|$ candidates for $\small k$, while PR needs to consider $\small |\mathcal{E}|^2$ candidates. Since all test triples in relation $\small k$ are considered at once, we do not rely on MRR for PR, but consider weighted $\small \text{MAP@K}$, i.e., weighted mean average precision in the top-$\small K$ filtered results, and weighted Hits@$\small K$, i.e., weighted percentage of test triples in the top$\small -K$ filtered results. …”

Knowledge Graph Embedding:

Additional Reading

[Table of Contents]

Knowledge Graph Edge Semantics

Here I collate and summarize/paraphrase “edge related” discussion from elsewhere in this REVIEW.

  • A major problem in current graph neural network models such as graph attention networks (GAT) and graph convolutional networks (GCN) is that edge features – which contain important information about graph – are not incorporated. Adaptive Edge Features Guided Graph Attention Networks (Sep 2018) proposed a graph learning model, edge features guided graph attention networks (EGAT). Guided by the edge features, the attention mechanism on pairs of graph nodes not only depended on node contents but also adjusted automatically with respect to the properties of the edge connecting those nodes. Moreover, the edge features were adjusted by the attention function and fed to the next layer, which meant that the edge features were adaptive across network layers.

  • Exploring Graph-structured Passage Representation for Multi-hop Reading Comprehension with Graph Neural Networks (Sep 2018) introduced a new method for better connecting the global evidence present in knowledge graphs, to form more complex graphs as compared to directed acyclic graphs (DAG). After obtaining representation vectors for question and entity mentions in passages, an additive attention model treated all entity mention representations and the question representation as the memory and the query, respectively. Footnote 2 was of interest: “The concurrent unpublished work (Cao et al., 2018) also investigates the usage of graph convolution networks on WikiHop. Our work proposes a different model architecture, and focus more on the exploration and comparison of multiple edge types for building the graph-structured passage representation.”

  • While representation learning on networks generates reliable results with regard to node embeddings, it is limited to homogeneous networks in which all nodes and edges are of the same type. Addressing this challenge, **edge2vec**: Learning Node Representation using Edge Semantics (Sep 2018) proposed a model that incorporated edge semantics to represent different edge-types in heterogeneous networks. Their model generated an edge type transition matrix as an extra criterion of a biased node random walk on networks, and a biased skip-gram model was then leveraged to learn node embeddings based on the random walks. During the process to generate the node random walk corpus, the heterogeneity of the network was considered. By considering edge semantics, edge2vec significantly outperformed other state of the art models on all three tasks [note that in their tables, edge2vec is listed as heterogeneous node2vec].

  • mvn2vec: Preservation and Collaboration in Multi-View Network Embedding (Jan 2018) focused on characteristics that were specific and important in embedding multi-view networks (a multi-view network consists of multiple network views, where each view corresponds to a type of edge, and all views share the same set of nodes). With respect to edges, two comments are relevant, here:

    • [re: Collaboration:] In some datasets, edges between the same pair of nodes may be observed in different views due to shared latent reasons. For instance, if nodes in two views may complement (interact with) each other in various social media contexts, embedding them jointly may potentially yield better results than embedding them independently.

    • [re: Preservation:] On the other hand, it is possible for different network views to have different semantic meanings; it is also possible that a portion of nodes have completely disagreeing edges in different views since edges in different views are formed due to distinct latent reasons.

[Table of Contents]

Multi-View / Multi-Layer Graph Embedding

It is intuitive to represent and relatively easy to model metabolic pathways, experiments, etc. as a graph structures. Since metabolism is a dynamic, temporal process, the graph should also reflect/represent various states – e.g. metabolic states (enzyme isoforms; metabolite concentrations; etc.), or health status (healthy; diabetic; breast cancer; etc.) – perhaps through different graph representations. Likewise, metadata encoded in a graph representing an experiment could reflect changes in experimental variables over time, or in response to different experimental conditions (controls vs. experiments). Thus, depending on the state, we could have graphs that at different times differ in the data represented in the nodes and/or the connections between them (different paths or altered strengths of those connections).

Relevant to the embedding of temporal relations in graphs are the following:

  • Learning Sequence Encoders for Temporal Knowledge Graph Completion (Sep 2018) [code] [Summary]“Research on link prediction in knowledge graphs has mainly focused on static multi-relational data. In this work we consider temporal knowledge graphs where relations between entities may only hold for a time interval or a specific point in time. In line with previous work on static knowledge graphs, we propose to address this problem by learning latent entity and relation type representations. To incorporate temporal information, we utilize recurrent neural networks to learn time-aware representations of relation types which can be used in conjunction with existing latent factorization methods. The proposed approach is shown to be robust to common challenges in real-world KGs: the sparsity and heterogeneity of temporal expressions. Experiments show the benefits of our approach on four temporal KGs.”
    Source: Learning Sequence Encoders for Temporal Knowledge Graph Completion

  • dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning (Sep 2018) [Summary]Learning graph representations is a fundamental task aimed at capturing various properties of graphs in vector space. The most recent methods learn such representations for static networks. However, real world networks evolve over time and have varying dynamics. Capturing such evolution is key to predicting the properties of unseen networks. To understand how the network dynamics affect the prediction performance, we propose an embedding approach which learns the structure of evolution in dynamic graphs and can predict unseen links with higher precision. Our model, dyngraph2vec, learns the temporal transitions in the network using a deep architecture composed of dense and recurrent layers. We motivate the need of capturing dynamics for prediction on a toy data set created using stochastic block models. We then demonstrate the efficacy of dyngraph2vec over existing state-of-the-art methods on two real world data sets. We observe that learning dynamics can improve the quality of embedding and yield better performance in link prediction.
    Source: dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning

  • [In the NLP domain:] Word-Level Loss Extensions for Neural Temporal Relation Classification

  • Dynamic Graph Neural Networks (Oct 2018)

    “… graph neural networks, which extend the neural network models to graph data, have attracted increasing attention. Graph neural networks have been applied to advance many different graph related tasks such as reasoning dynamics of the physical system, graph classification, and node classification. Most of the existing graph neural network models have been designed for static graphs, while many real-world graphs are inherently dynamic. For example, social networks are naturally evolving as new users joining and new relations being created. Current graph neural network models cannot utilize the dynamic information in dynamic graphs. However, the dynamic information has been proven to enhance the performance of many graph analytical tasks such as community detection and link prediction. Hence, it is necessary to design dedicated graph neural networks for dynamic graphs. In this paper, we propose DGNN (Dynamic Graph Neural Network) model, which can model the dynamic information as the graph evolving. In particular, the proposed framework can keep updating node information by capturing the sequential information of edges, the time intervals between edges and information propagation coherently.”

    arxiv1810.10627-f1.png

    [Image source. Click image to open in new window.]


    arxiv1810.10627-f2+f3.png

    [Image source. Click image to open in new window.]


Cellular growth and development provides an excellent example: a totipotent fertilized egg or a pluripotent stem cell can give rise to any other cell type – blood, liver, heart, brain, etc. – each arising from differential expression of the identical genome encoded within these cells throughout the lineage of those cell types; e.g.:

This cellular differentiation arises largely due to selective epigenetic modification of the genome during growth and development.

It would be exceptionally useful to be able to access multiple “views” of those types of graphs, depending on the data, metadata and embeddings.

Recent work in knowledge graph embedding such as TransE, TransH and TransR have shown competitive and promising results in relational learning. Non-Parametric Estimation of Multiple Embeddings for Link Prediction on Dynamic Knowledge Graphs (2017) [local copy] proposed a novel extension [ puTransE: Parallel Universe TransE ] of the translational embedding model to solve three main problems of the current models:

  • translational models are highly sensitive to hyperparameters such as margin and learning rate;

  • the translation principle only allows one spot in vector space for each “golden” triplet; thus, congestion of entities and relations in vector space may reduce precision;

  • current models are not able to handle dynamic data, especially the introduction of new unseen entities/relations or removal of triplets.

The approach proposed for puTransE proposed explicitly generated multiple embedding spaces via semantically and structurally aware triplet selection schemes, and non-parametrically estimated the energy score of a triplet. The intuition for the approach was that in every “parallel universe” embedding space, a constraint on triplets was imposed in terms of count and diversity such that each embedding space observed the original knowledge graph from a different view. The proposed puTransE approach was simple, robust and parallelizable. Experimental results showed that puTransE outperformed TransE and many other embedding methods for link prediction on knowledge graphs on both public benchmark dataset and a real world dynamic dataset.

Tay2017nonparametric-f1+f2+t2+t4.png

[Image source. Click image to open in new window.]


KSR: A Semantic Representation of Knowledge Graph within a Novel Unsupervised Paradigm (May 2018) proposed a semantic representation method for knowledge graph (KSR) which imposed a two-level hierarchical generative process that globally extracted many aspects and then locally assigned a specific category in each aspect for every triple. Notably, they introduced a “knowledge features” (i.e. views for clustering) to describe some knowledge semantic aspects, such as being a university or not (University), geographical position (Location), etc. By gathering the information of clusters in each view/feature, the semantic representations were formed. The first-level process generated knowledge features (views) with different semantics such as University and Location, while the second-level process grouped the entities/relations/triples according to the corresponding semantic features (views). Finally, by summarizing the cluster identifications within each feature/view, KSR constructed the semantic representation of the knowledge elements. For example, for “Tsinghua University” a Yes category (cluster identification) is assigned to the University feature, while the Beijing category/cluster identification is assigned to the Location feature. By exploiting multi-view clustering, this knowledge was semantically organized as $\small \color{Brown}{\text{Tsinghua University = (University:Yes, Location:Beijing)}}$.

arxiv1608.07685-f1+f2.png

arxiv1608.07685-f3.png

arxiv1608.07685-t1+t2.png

[Image source. Click image to open in new window.]


Multi-view Clustering with Graph Embedding for Connectome Analysis (2017) from the University of Illinois at Chicago addressed multi-view clustering on graph instances with their model Multi-view Clustering framework on graph instances with Graph Embedding (MCGE), in which they modeled multi-view graph data as tensors and applied tensor factorization to learn the multi-view graph embeddings thereby capturing the local structure of graphs. They built an iterative framework incorporating multi-view graph embedding into the multi-view clustering task on graph instances, jointly performing multi-view clustering and multi-view graph embedding simultaneously. The multi-view clustering results were used for refining the multi-view graph embedding, and the updated multi-view graph embedding results further improved the multi-view clustering.

MCGE.png

[Image source. Click image to open in new window.]


Extensive experiments on two brain network datasets (HIV and Bipolar) demonstrated the superior performance of the proposed MCGE approach in multi-view connectome analysis for clinical investigation and application. In a simple two-view example, given fMRI and DTI brain networks of five subjects MCGE aimed to learn multi-view graph embedding for each of them and cluster these subjects into different groups based on the obtained multi-view graph embeddings. In the HIV and bipolar brain networks (their Figs. 4 and 5, respectively), differential brain region clustering maps of normal controls and patients could be compared. Each color-coded node represented a brain region, and each edge indicated the correlation between two brain regions. Nodes of the normal brain network were well grouped into several clusters, while nodes in the HIV brain network were less coherent. Additionally, for the normal control the edges within each cluster were much more intense than the edges across different clusters.

  • This work was cited in earlier work (also in 2017) by the same authors, Multi-view Graph Embedding with Hub Detection for Brain Network Analysis (Sep 2017). In that paper, they proposed incorporating the hub detection task into the multi-view graph embedding framework so that the two tasks could benefit each other, via an auto-weighted framework of Multi-view Graph Embedding with Hub Detection (MVGE-HD) for brain network analysis (again, HIV and bipolar brain networks). The MVGE-HD framework learned a unified graph embedding across all the views while reducing the potential influence of the hubs on blurring the boundaries between node clusters in the graph, thus leading to a clear and discriminative node clustering structure for the graph.

    arxiv1709.03659-f1+f2+f3.png

    [Image source. Click image to open in new window.]


  • Follow-on work (2018) by those authors, Multi-View Multi-Graph Embedding for Brain Network Clustering Analysis (Jun 2018), proposed Multi-view Multi-graph Embedding (M2E) by stacking multi-graphs into multiple partially-symmetric tensors and using tensor techniques to simultaneously leverage the dependencies and correlations among multi-view and multi-graph brain networks. Although there had been work on single-graph embedding and multi-view learning, there had been no embedding method available which enabled preserving multi-graph structures (i.e. taking multiple graphs as input) on multiple views. The goal of M2E is to find low-dimensional representations from multi-view multi-graph data which reveal patterns and structures among the brain networks. Experiments (again) on real HIV and bipolar disorder brain network datasets demonstrated the superior performance of M2E on clustering brain networks by leveraging the multi-view multi-graph interactions.

    arxiv-1806.07703.png

    [Image source. Click image to open in new window.]


    arxiv1806.07703-f2.png

    [Image source. Click image to open in new window.]


    • Critique.

      While these authors survey clustering accuracy of a number of baselines against M2E (“We compare the proposed M2E with eight other methods for multi-view clustering on brain networks.”), in a rather suspect omission they do not include comparisons of this current work to their earlier MCGE or MVGE-HD work (cited in this paper as Ma et al. 2017a and Ma et al. 2017b, respectively) among those models.

For many real-world systems, multiple types of relations are naturally represented by multiple networks. However, existing network embedding methods mainly focus on single network embedding and neglect the information shared among different networks. For example, in social networks relationships between people may include friendships, money transfers, colleagues, etc. One simple solution for multi-network embedding is to summarize multiple networks into a single network and apply the single-network embedding method on the integrated network. Deep Feature Learning of Multi-Network Topology for Node Classification (Sep 2018) proposed a novel multiple network embedding method based on semisupervised autoencoder, named DeepMNE, which captured complex topological structures of multi-networks, taking the correlation among multi-networks into account.

The DeepMNE algorithm mainly contained two parts: obtaining topological information learning, and learning multi-network-based features. After obtaining the integrated representations of multi-networks, they trained a machine learning model based on the outputs of DeepMNE to classify the nodes. Experimental results on two real world datasets (yeast and human gene networks) demonstrated the superior performance of the method over four state of the art algorithms on tasks such as accurate prediction of gene function.

arxiv-1809.02394.png

[Image source. Click image to open in new window.]


Existing approaches to learning node representations usually study networks with a single type of proximity between nodes, which defines a single view of a network. However, in reality multiple types of proximities between nodes usually exist, yielding networks with multiple views. An Attention-based Collaboration Framework for Multi-View Network Representation Learning (Sep 2017) [code] studied learning node representations for networks with multiple views, which aimed to infer robust node representations across different views. They proposed a multi-view representation learning approach (MVE), which promoted the collaboration of different views, letting them vote for robust representations. During the voting process, an attention mechanism was introduced, which enabled each node to focus on the most informative views. Experimental results on real-world networks showed that the proposed approach outperformed state of the art approaches for network representation learning (node classification; link prediction) with a single or multiple views.

arxiv-1709.06636c.png

[Image source. Click image to open in new window.]


mvn2vec: Preservation and Collaboration in Multi-View Network Embedding (Jan 2018) focused on characteristics that were specific and important in embedding multi-view networks. They identified two characteristics, “preservation” and “collaboration” and explored the feasibility of achieving better embedding quality by simultaneously modeling preservation and collaboration. As shown in their Fig. 1a, a multi-view network consists of multiple network views, where each view corresponds to a type of edge, and all views share the same set of nodes.

arxiv-1801.06597.png

[Image source. Click image to open in new window.]


  • This paper, and An Attention-based Collaboration Framework for Multi-View Network Representation Learning (Sep 2017) are conceptually similar (compare Fig. 1
    Sources: arXiv:1709.06636 and arXiv:1801.06597
    in each paper). Dr. Jiawei Han, University of Chicago at Urbana-Champaign, is also a coauthor on both papers.

  • Re: “Collaboration ”: in some datasets, edges between the same pair of nodes may be observed in different views due to shared latent reasons. For instance, if nodes in two views may complement (interact with) each other in various social media contexts, embedding them jointly may potentially yield better results than embedding them independently.

  • Re: “Preservation ”: on the other hand, it is possible for different network views to have different semantic meanings; it is also possible that a portion of nodes have completely disagreeing edges in different views since edges in different views are formed due to distinct latent reasons. For example, professional relationships may not align well with friendships. If we embed the profession and the friendship views into the same embedding space, the embedding fails to preserve the unique information carried by the different network views. The authors refer to the need for preserving unique information carried by different views as “preservation.”

  • It is also possible for preservation and collaboration to co-exist in the same multi-view network.

Real-world social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Sub-networks of such a network based on each type of link correspond to distinct views of the underlying network. In real-world applications each node is typically linked to only a small subset of other nodes; hence, practical approaches to problems such as node labeling have to cope with the resulting sparse networks. While low-dimensional network embeddings offer a promising approach to this problem, most of the current network embedding methods focus primarily on single view networks. Multi-View Network Embedding Via Graph Factorization Clustering and Co-Regularized Multi-View Agreement (Nov 2018) introduced a novel multi-view network embedding (MVNE) algorithm for constructing low-dimensional node embeddings from multi-view networks. MVNE adapted and extended an approach to single view network embedding (SVNE) using graph factorization clustering (GFC) to the multi-view setting using an objective function that maximized the agreement between views based on both the local and global structure of the underlying multi-view graph. Experiments with several benchmark real-world single view networks showed that GFC-based SVNE yielded network embeddings that were competitive with or superior to those produced by the state of the art single view network embedding methods when the embeddings are used for labeling unlabeled nodes in the networks. Experiments with several multi-view networks showed that MVNE substantially outperformed the single view methods on integrated view and the state of the art multi-view methods. Even when the goal was to predict labels of nodes within a single target view, MVNE outperformed its single-view counterpart, suggesting that MVNE was able to extract the information that was useful for labeling nodes in the target view from the all of the views.

arxiv1811.02616-t1.png

[Image source. Click image to open in new window.]


arxiv1811.02616-t2+t3.png

[Image source. Click image to open in new window.]


In a multi-view graph embedding, each node is assigned a vector that incorporates data from all views of the graph. Simple methods to create multi-view embeddings include combining multiple views of the graph into one graph using an $\small \text{AND/OR}$ aggregation of the edge sets and embedding the resulting single graph, or embedding each view independently and concatenating the different embeddings obtained for each node. More sophisticated algorithms have been developed based on matrix factorization, tensor factorization, and spectral embedding. Many of these algorithms focus on clustering multi-view graphs, a specific application thereof. High clustering accuracy indicates a good embedding since relative similarity between nodes should be correctly reflected in the embedding. … . Ideally, graph embeddings should preserve the distances between the nodes in their respective node embeddings. [Discussion in this paragraph was drawn from the Introduction in Multi-View Graph Embedding Using Randomized Shortest Paths, and the references cited therein.)

Multi-View Graph Embedding Using Randomized Shortest Paths (Aug 2018) [datasets, experimental results] proposed a generalized distance on multi-view graphs called the Common Randomized Shortest Path Dissimilarity (C-RSP), based on the randomized shortest path (RSP) dissimilarity measure on single-view graphs. This algorithm generated a dissimilarity measure between nodes by minimizing the expected cost of a random walk between any two nodes across all views of a multi-view graph, in doing so encoding both the local and global structure of the graph. This leads to more accurate graph embeddings, resulting in better visualization and high clustering accuracy.

arxiv1808.06560-f3.png

[Image source. Click image to open in new window.]


arxiv1808.06560-t1.png

[Image source. Click image to open in new window.]


Unsupervised Multi-view Nonlinear Graph Embedding (2018) addressed “multi-view” graph embedding: given a graph with node features they aimed to learn a network embedding (for each node from its network features) and a content embedding (from its content features) simultaneously for each node in an unsupervised manner. Generally, network structure and node features are considered as two different “views” for a node in the graph; in this work the authors allowed the network and content “views” to reinforce one another, to obtain better graph embeddings. They proposed a simple, effective unsupervised Multi-viEw nonlineaR Graph Embedding (MERGE) model. Inspired by structural deep network embedding, MERGE encoded the nonlinearity of the network/content by taking the network/content features as input, and then applying a deep autoencoder to learn a nonlinear network/content embedding for each node. MERGE also preserved the second-order proximity by extending DeepWalk to the multi-view setting: MERGE extended DeepWalk enabling one node’s embedding to interpret its “neighbor” node’s content embedding, enforcing cross-instance-cross-view consistency. MERGE consistently outperformed the state of the art baselines over five public datasets including PubMed (their Table 2) with $\small \theta(\vert V \vert)$ complexity, with one-third the training data.

MERGE-1.png

[Image source. Click image to open in new window.]


MERGE-2.png

[Image source. Note that Table is truncated, here. Click image to open in new window.]


End-to-End Multi-View Networks for Text Classification (Apr 2017) [non-author code herehere and here;  discussion here and here: the OP later deleted their question but the comments are apropos] from National Research Council Canada proposed a “multi-view network” for text classification. Their MVN model was outperformed by Richard Socher/SalesForce’s BCN+Char+CoVe algorithm (see their Table 4 in Learned in Translation: Contextualized Word Vectors).

arxiv-704.05907.png

[Image source. Click image to open in new window.]


Hypergraph Neural Networks (Sep 2018) presented a hypergraph neural network (HGNN) framework for data representation learning, which could encode correlated, high-order data in a hypergraph structure. Confronting the challenge of representation learning for complex data, they proposed incorporating those data in a hypergraph, which was more flexible for modeling complex data. A hyperedge convolution operation handled the data correlation during representation learning, permitting the efficient use of traditional hypergraph learning procedures to be used using hyperedge convolution operations. HGNN was thus able to learn the hidden layer representation (the high-order data structure).

While graphs are effective in modeling complex relationships, in many scenarios a single graph is rarely sufficient to succinctly represent all interactions, and hence multi-layered graphs have become popular. Though this leads to richer representations, extending solutions from the single-graph case is not straightforward. Consequently, there is a strong need for novel solutions to solve classical problems, such as node classification, in the multi-layered case. Attention Models with Random Features for Multi-layered Graph Embeddings (Oct 2018) considered the problem of semi-supervised learning with multi-layered graphs. Though deep network embeddings, e.g. DeepWalk, are widely adopted for community discovery, these authors argue that feature learning with random node attributes – using graph neural network – can be more effective. They proposed the use of attention models for effective feature learning, and developed two novel architectures, GrAMME-SG and GrAMME-Fusion, that exploited the inter-layer dependencies for building multi-layered graph embeddings. Empirical studies on several benchmark datasets demonstrated significant performance improvements in comparison to state of the art network embedding strategies. The results also showed that using simple random features was an effective choice, even in cases where explicit node attributes were not available.

arxiv-1810.01405a.png

[Image source. Click image to open in new window.]


arxiv-1810.01405b.png

[Image source. Click image to open in new window.]


In very interesting work, Harada et al. Dual Convolutional Neural Network for Graph of Graphs Link Prediction (Oct 2018) proposed the use of graphs of graphs (GoG) for feature extraction from graphs. GoG consist of an external graph and internal graphs, where each node in the external graph has an internal graph structure. They proposed a dual CNN that (i) extracted node representations by combining the external and internal graph structures in an end-to-end manner, and (ii) efficiently learned low-dimensional representations of the GoG nodes. Experiments on link prediction tasks using several chemical network datasets demonstrated the effectiveness of the proposed method.

arxiv-1810.02080.png

[Image source. Click image to open in new window.]


A Recurrent Graph Neural Network for Multi-Relational Data (Nov 2018) [code] introduced a graph recurrent neural network (GRNN) for scalable semi-supervised learning from multi-relational data. Key aspects of the novel GRNN architecture were the use of multi-relational graphs, the dynamic adaptation to the different relations via learnable weights, and the consideration of graph-based regularizers to promote smoothness and alleviate over-parametrization. The goal was to design a powerful learning architecture able to (i) discover complex and highly non-linear data associations, (ii) combine (and select) multiple types of relations, and (iii) scale gracefully with respect to the size of the graph. Numerical tests with real data sets corroborated the design goals and illustrated the performance gains relative to competing alternatives.

arxiv1811.02061-f1+f2.png

[Image source. Click image to open in new window.]


arxiv1811.02061-t1+t2.png

[Image source. Click image to open in new window.]


Multi-View / Multi-Layer Graph Embedding:

Additional Reading

  • Graph Signal Processing
  • Hyperbolic Embeddings
  • Multi-Multi-View Learning: Multilingual and Multi-Representation Entity Typing (Oct 2018) [Summary] “Knowledge bases (KBs) are paramount in NLP. We employ multiview learning for increasing accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we consider high-resource and low-resource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity’s name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view. Our experiments on entity typing with fine-grained classes demonstrate the effectiveness of multiview learning.”

    [Click image to enlarge.]
    Source: Multi-Multi-View Learning: Multilingual and Multi-Representation Entity typing

[Table of Contents]

Hypergraphs

Here I collate and summarize/paraphrase hypergraph-related discussion from elsewhere in this REVIEW.



Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) [code | supplementary material] proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. Their model was robust in handling both overlapping and non-overlapping mentions, and was thus able to capture features and interactions that could not be captured by previous models while maintaining a low time complexity for inference. They also presented a theoretical analysis to formally assess how our representation is better than alternative representations reported in the literature in terms of representational power. Coupled with neural networks for feature learning, their model achieved the state of the art performance in three benchmark datasets [including GENIA] annotated with overlapping mentions.

arxiv-1810.01817a.png

[Image source. Click image to open in new window.]


arxiv-1810.01817b.png

[Image source. Click image to open in new window.]


arxiv-1810.01817c.png

[Image source. Click image to open in new window.]


In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018; above), Learning to Recognize Discontiguous Entities (Oct 2018) [project page/code] focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.

arxiv1810.08579-f1+f2.png

[Image source. Click image to open in new window.]


arxiv1810.08579-f3.png

[Image source. Click image to open in new window.]




Hypergraph Neural Networks (Sep 2018) presented a hypergraph neural network (HGNN) framework for data representation learning, which could encode correlated, high-order data in a hypergraph structure. Confronting the challenge of representation learning for complex data, they proposed incorporating those data in a hypergraph, which was more flexible for modeling complex data. A hyperedge convolution operation handled the data correlation during representation learning, permitting the efficient use of traditional hypergraph learning procedures to be used using hyperedge convolution operations. The convolution in the spectral domain was conducted with the hypergraph Laplacian, and further approximated by truncated Chebyshev polynomials. HGNN was thus able to learn the hidden layer representation (the high-order data structure). HGNN, graph convolutional networks (GCN) and other methods were applied to citation network classification and visual object recognition tasks, demonstrating that the proposed HGNN method could outperform recent state of the art methods. HGNN was also superior than other methods when dealing with multi-modal data.

arxiv-1809.09401a.png

[Image source. Click image to open in new window.]


arxiv-1809.09401b.png

[Image source. Click image to open in new window.]


arxiv-1809.09401c.png

[Image source. Click image to open in new window.]




While GCN inherently assume pairwise relationships in graph-structured data, in many real-world problems relationships extend beyond pairwise connections – hypergraphs naturally capture these complex relationships. HyperGCN: Hypergraph Convolutional Networks for Semi-Supervised Classification (Sep 2018) explored the use of GCN for hypergraph based semi-supervised learning (SSL). They proposed HyperGCN, a SSL method which used a layer-wise propagation rule for CNN operating directly on hypergraphs (the first principled adaptation of GCN to hypergraphs). HyperGCN was able to encode both the hypergraph structure and hypernode features in an effective manner.

“In conventional graph-based SSL problems, the loss function is defined as a weighted sum of the supervised loss over labeled data and a regulariser for the graph structure. $\small \mathcal{L} = \underbrace{\mathcal{L}_0}_\text{labeled data} + \underbrace{\lambda \mathcal{L}_{reg}}_\text{graph data}$. Here, $\small \mathcal{L}_0$ denotes the supervised loss w.r.t. the labeled part of the graph and $\small \mathcal{L}_{reg}$ is an explicit graph-based regulariser that smooths the label information over the graph (with $\small \lambda$ being the weighting factor). A popularly used graph-based regulariser is the graph Laplacian regulariser which relies on the prior assumption that connected nodes in the graph are likely to share the same label (the cluster assumption (Chapelle, Weston, and Schőlkopf 2003). This assumption might restrict modeling capacity, as graph edges need not encode node similarity, but could instead contain other information such as pointwise mutual semantic information (Zhuang and Ma 2018), or knowledge graph relationship information (Wang, Ye, and Gupta 2018).

“To avoid this restriction, recent works have encoded both the labeled and graph data using a convolutional neural network (Atwood and Towsley 2016Kipf and Welling 2017). This allows the network to be trained on a supervised loss function $\small \mathcal{L} = \mathcal{L}_0$ for all graph nodes, thereby, avoiding an explicit graph-based regulariser in the loss function. Specifically, graph convolutional networks (GCNs) (Kipf and Welling 2017) naturally integrate the graph structure (e.g., citation networks) and the feature attributes of nodes (e.g., bag-of-words features). GCNs have achieved state-of-the art performance on benchmark graph-based semi-supervised node classification datasets. Even though GCNs are able to incorporate arbitrary relationships between nodes (and not just similarity), they are still limited by pairwise relationships in the graph.

“However, in many real-world problems, relationships go beyond pairwise associations. In many naturally occurring graphs, we observe complex relationships involving more than two nodes, e.g., co-authorship, co-citation, email communication, etc. Hypergraphs provide a flexible and natural modeling tool to model such complex relationships. A hypergraph is a generalisation of a simple graph in which an edge (a.k.a., hyperedge) can connect any number of nodes. While simple graph edges connect pairs of nodes (pairwise relationships), hyperedges can connect an arbitrary number of nodes (and thus capture complex relationships). For example, in a co-authorship network modeled as a hypergraph, each node represents a paper and each hyperedge represents an author and connects all papers coauthored by the author. Because of the obvious existence of such complex relationships in many real-world networks, the problem of learning with hypergraphs assumes significance (Zhou, Huang, and Schölkopf 2007; Hein et al. 2013 Zhang et al. 2017).

“In this work, we consider the problem of hypergraph-based semi supervised classification. …”

arxiv1809.02589-f1.png

[Image source. Click image to open in new window.]


arxiv1809.02589-f2.png

[Image source. Click image to open in new window.]


arxiv1809.02589-t3.png

[Image source. Click image to open in new window.]


arxiv1809.02589-t4+t5.png

[Image source. Click image to open in new window.]




Most network-based machine learning methods assume that the labels of two adjacent samples in the network are likely to be the same. However, assuming the pairwise relationship between samples is not complete. The information a group of samples that shows very similar pattern and tends to have similar labels is missed. The natural way overcoming the information loss of the above assumption is to represent the feature dataset of samples as the hypergraph. Un-normalized Hypergraph p-Laplacian Based Semi-Supervised Learning Methods (Nov 2018) were applied to the zoo dataset and the tiny version of 20 newsgroups dataset. Experiment results showed that the accuracy performance measures of these unnormalized hypergraph p-Laplacian based semi-supervised learning methods were significantly greater than the accuracy performance measure of the non-p-Laplacian method (the current state of the art method hypergraph Laplacian based semi-supervised learning method for classification problems).

Zhou2006learning-f1.png

[Image source. [Original source.] Click image to open in new window.]


[Table of Contents]

Graph Signal Processing

This is a fascinating domain. For additional background, see my comprehensive Resources pages:



Recent work mentioned but not discussed on that page includes work by Zou and Lerman [Graph Generation via Scattering (Sep 2018)] at the University of Minnesota, which employed the graph wavelet approach of Wavelets on Graphs via Spectral Graph Theory (2011). While generative models like generative adversarial networks (GAN) and variational autoencoders (VAE) have recently been applied to graphs, they are difficult to train. This work proposed a graph generation model that used an adaptation of a scattering transform to graphs. The proposed model was composed of an encoder (a Gaussianized graph scattering transform) and a decoder (a simple fully connected network that is adapted to specific tasks, such as link prediction, signal generation on graphs and full graph and signal generation). Results demonstrated state of the art performance of the proposed system for both link prediction (Cora, Citeseer and PubMed citation data) and graph and signal generation.

arxiv-1809.10851.png

[Click image to open in new window.]


The application of CNN to structured signal classification (e.g. image classification) inspired the development of deep filter banks, referred to as scattering transforms. These transforms apply a cascade of wavelet transforms and complex modulus operators to extract features that are invariant to group operations and stable to deformations. Furthermore, ConvNets inspired recent advances in geometric deep learning, which aim to generalize these networks to graph data by applying notions from graph signal processing to learn deep graph filter cascades. Graph Classification with Geometric Scattering (Oct 2018) further advanced those lines of research by proposing a geometric scattering transform using graph wavelets defined in terms of random walks on the graph. They demonstrated the utility of features extracted with this designed deep filter bank in graph classification, and showed its competitive performance relative to other methods, including graph kernel methods and geometric deep learning ones, on both social and biochemistry data.

“In supervised graph classification problems one is given a training database of graph/label pairs $\small \{ (G_i,y{_i})^{N}_{i=1} \subset \mathcal{G} \times \mathcal{Y} \}$ sampled from a set of potential graphs $\small \mathcal{G}$ and potential labels $\small \mathcal{Y}$. The goal is to use the training data to learn a model $\small f : \mathcal{G} → \mathcal{Y}$ that associates to any graph $\small G ∈ \mathcal{G}$ a label $\small y = f(G) \in \mathcal{Y}$. These types of databases arise in biochemistry, in which the graphs may be molecules and the labels some property of the molecule (e.g., its toxicity), as well as in various types of social network databases. Until recently, most approaches were kernel based methods, in which the model $\small f$ was selected from the reproducing kernel Hilbert space generated by a kernel that measures the similarity between two graphs [ … snip … ]

“In many of these algorithms, task based (i.e., dependent upon the labels $\small \mathcal{Y}$) graph filters are learned from the training data as part of the larger network architecture. These filters act on a characteristic signal $\small \mathbf{x}_G$ that is defined on the vertices of any graph $\small G$, e.g., $\small \mathbf{x}_G$ is the vector of degrees of each vertex (we remark there are also edge based algorithms, such as Gilmer et al. and references within, but these have largely been developed for and tested on databases not considered in Sec. 4).

“Here, we propose an alternative to these methods in the form of a Geometric Scattering Classifier (GSC) that leverages graph-dependent (but not label dependent) scattering transforms to map each graph $\small G$ to the scattering features extracted from $\small \mathbf{x}_G$. Furthermore, inspired by transfer learning approaches, we apply the scattering cascade as frozen network layers on $\small \mathbf{x}_G$, followed by several fully connected classification layers (see Fig. 2). We note that while the formulation in Sec. 3 is phrased for a single signal $\small \mathbf{x}_G$, it naturally extends to multiple signals by concatenating their scattering features.

  • “… our evaluation results on graph classification show the potential of the produced scattering features to serve as universal representations of graphs. Indeed, classification with these features with relatively simple classifier models reaches high accuracy results on most commonly used graph classification datasets, and outperforms both traditional and recent deep learning feed forward methods in terms of average classification accuracy over multiple datasets. …

    Finally, the geometric scattering features provide a new way for computing and considering global graph representations, independent of specific learning tasks. Therefore, they raise the possibility of embedding entire graphs in Euclidean space (albeit high dimensional) and computing meaningful distances between graphs with them, which can be used for both supervised and unsupervised learning, as well as exploratory analysis of graph-structured data.”

arxiv-1810.03068a.png

[Image source. Click image to open in new window.]


arxiv-1810.03068b.png

[Image source. Click image to open in new window.]


Graph classification has recently received a lot of attention from various fields of machine learning e.g. kernel methods, sequential modeling or graph embedding. All these approaches offer promising results with different respective strengths and weaknesses. However, most of them rely on complex mathematics and require heavy computational power to achieve their best performance. A Simple Baseline Algorithm for Graph Classification (Oct 2018) proposed a simple and fast algorithm based on the spectral decomposition of the graph Laplacian to perform graph classification and get a first reference score for a dataset. This method obtained competitive results compared to state of the art algorithms.

arxiv1810.09155-f1.png

[Image source. Click image to open in new window.]


arxiv1810.09155-t1+t2.png

[Image source. Click image to open in new window.]


arxiv1810.09155-f2.png

[Image source. Click image to open in new window.]




Many networks exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. Higher-Order Organization of Complex Networks (Jul 2016) [project;  slides here and here], by Jure Leskovec and colleagues at Stanford University, developed a generalized framework for clustering networks on the basis of higher-order connectivity patterns, at the level of small network subgraphs. That project focused on finding higher-order organization of complex networks at the level of small network subgraphs (motifs), by clustering the nodes of a graph based on motifs instead of edges via a motif-based spectral clustering method.

  • “Graphs are a pervasive tool for modeling and analyzing network data throughout the sciences. Benson et al. developed an algorithmic framework for studying how complex networks are organized by higher-order connectivity patterns (see the Perspective by Pržulj and Malod-Dognin). Motifs in transportation networks reveal hubs and geographical elements not readily achievable by other methods. A motif previously suggested as important for neuronal networks is part of a “rich club” of subnetworks.”

    Benson2016higher-order-f1.png

    [Image source. Click image to open in new window.]


    Benson2016higher-order-f1c.png

    [Image source;  (see also). Click image to open in new window.]


  • The algorithm illustrated in Fig. 1C, above, efficiently identifies a cluster of nodes $\small S$ as follows:

    • Step 1: Given a network and a motif $\small M$ of interest, form the motif adjacency matrix $\small W_M$ whose entries $\small (i,j)$ are the co-occurrence counts of nodes $\small i$ and $\small j$ in the motif $\small M:(W _jM)_{ij}$ = number of instances of $\small M$ that contain nodes $\small i$ and $\small j$.

    • Step 2: Compute the spectral ordering $\small \sigma$ of the nodes from the normalized motif graph Laplacian matrix constructed via $\small W_M$.

      • The normalized motif Laplacian matrix is $\small L_M = D^{-1/2}(D - W_M)D^{-1/2}$, where $\small D$ is a diagonal matrix with the row-sums of $\small W_M$ on the diagonal, $\small D_{ii} = \sum_j (W_M)_{ij}$ , and $\small D^{-1/2}$ is the same matrix with the inverse square roots on the diagonal $\small D_{ii}^{-1/2} = 1 / \sqrt{\sum_j (W_M)_{ij}}$. The spectral ordering $\small \sigma$ is the by-value ordering of $\small D^{-1/2}z$, where $\small z$ is the eigenvector corresponding to the second smallest eigenvalue of $\small L_M$, i.e., $\small \sigma_i$ is the index of $\small D^{-1/2}z$ with the $\small i^{th}$ smallest value.

    • Step 3: Find the prefix set of $\small \sigma$ with the smallest motif conductance (the argument of the minimum), formally, $\small S :=argmin_r \phi_M(S_r)$, where $\small S_r = \{\sigma_1, \ldots, \sigma_r \}$.

      Benson2016higher-order-f1c-annotated.png

      [Image source;  (see also). Click image to open in new window.]


  • [Perspective]  Network Analytics in the Age of Big Data (2016):

    “We live in a complex world of interconnected entities. In all areas of human endeavor, from biology to medicine, economics, and climate science, we are flooded with large-scale data sets. These data sets describe intricate real-world systems from different and complementary viewpoints, with entities being modeled as nodes and their connections as edges, comprising large networks. These networked data are a new and rich source of domain-specific information, but that information is currently largely hidden within the complicated wiring patterns. Deciphering these patterns is paramount, because computational analyses of large networks are often intractable, so that many questions we ask about the world cannot be answered exactly, even with unlimited computer power and time. Hence, the only hope is to answer these questions approximately (that is, heuristically) and prove how far the approximate answer is from the exact, unknown one, in the worst case. On page 163 of this issue, Benson et al. take an important step in that direction by providing a scalable heuristic framework for grouping entities based on their wiring patterns and using the discovered patterns for revealing the higher-order organizational principles of several real-world networked systems.”  [Source]

    Prulj2016network-f1.png

    [Image source. Click image to open in new window.]


  • Response.  “Hypergraph-Based Spectral Clustering of Higher-Order Network Structures” [Tom Michoel & Bruno Nachtergaele] (July 2016):

    • “The authors refer to our work T Michoel et al, Mol. Bio. Syst. 7, 2769 (2011)  [local copy] where we introduced an algorithm for clustering networks on the basis of 3-node network motifs, but appear to have missed our subsequent work where this algorithm was extended into a general spectral clustering algorithm for hypergraphs T Michoel and B Nachtergaele, Phys Rev E 86, 05611 (2012). As a special case, and similar to SNAP, this algorithm can be (and was) used to cluster signed, colored or weighted networks based on network motifs or subgraph patterns of arbitrary size and shape, including patterns of unequal size such as shortest paths. An implementation of the algorithm is available on GitHub.”

    Michoel et al. (2011):


    Michoel2011enrichment-f2.png

    [Image source. Click image to open in new window.]


    Michoel2011enrichment-f3.png

    [Image source. Click image to open in new window.]


    Michoel2011enrichment-f4.png

    [Image source. Click image to open in new window.]


    Michoel & Nachtergaele, 2012:


    arxiv1205.3630-f1.png

    [Image source. Click image to open in new window.]


    arxiv1205.3630-f3.png

    [Image source. Click image to open in new window.]


    arxiv1205.3630-f4.png

    [Image source. Click image to open in new window.]


    arxiv1205.3630-f5.png

    [Image source. Click image to open in new window.]




Graph Signal Processing:

Additional Reading

  • Multilayer Graph Signal Clustering (Nov 2018)

    “Multilayer graphs are commonly used to model relationships of different types between data points. In this paper, we propose a method for multilayer graph data clustering, which combines the different graph layers in the Riemann manifold of Semi-Positive Definite (SPD) graph Laplacian matrices. The resulting combination can be seen as a low-dimensional representation of the original data points. In addition, we consider that data can also carry signal values and not only graph information. We thus propose new clustering solution for such hybrid data by training a neural network such that the transformed data points are orthonormal, and their distance on the aggregated graph is minimized. Experiments on synthetic and real data show that our method leads to a significant improvement with respect to state-of-the-art clustering algorithms for graph data.”

    arxiv1811.00821-t1+t2+t3.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Link prediction, the discovery of relations such as (subject, relation, object) triples, is tremendously useful for knowledge graph {construction | embedding | completion}, fact checking, and knowledge discovery.

Knowledge graph completion deals with automatically understanding the structure of large knowledge graphs (labeled directed graphs), and predicting missing relationships (labeled edges). In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. Complex Embeddings for Simple Link Prediction (Jun 2016) [code] – by Théo Trouillon (Xerox Research Centre Europe) and colleagues – proposed to solve this problem through latent factorization, but through the use of complex valued embeddings. The composition of complex embeddings could handle a large variety of binary relations, among them symmetric and antisymmetric relations. Compared to state of the art models such as Neural Tensor Networks and Holographic Embeddings, their approach – based on complex embeddings – was arguably simpler, as it only uses the Hermitian dot product (the complex counterpart of the standard dot product between real vectors). Their approach, ComplEx, was scalable to large datasets as it remained linear in time and space, consistently outperforming alternative approaches on standard link prediction benchmarks.

arxiv1606.06357-t2+t4.png

[Image source. Click image to open in new window.]


State of the art (statistical relational learning / knowledge graph completion) embedding models propose different trade-offs between modeling expressiveness, and time and space complexity. Trouillon et al. followed up their ComplEx paper (Jun 2016) with Knowledge Graph Completion via Complex Tensor Factorization (Feb 2017; updated Nov 2017) [code], in which they reconciled both expressiveness and complexity through the use of complex-valued embeddings, and explored the link between such complex-valued embeddings and unitary diagonalization. They corroborated their approach theoretically and showed that all real square matrices – thus all possible relation/adjacency matrices – are the real part of some unitarily diagonalizable matrix – opening the door to other applications of square matrices factorization. Their approach, based on complex embeddings, was arguably simple (as it only involves a Hermitian dot product, the complex counterpart of the standard dot product between real vectors), whereas other methods resorted to increasingly complicated composition functions to increase their expressiveness. The proposed complex embeddings were scalable to large data sets as they remain linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.

“This extended version adds proofs of existence of the proposed model in both single and multi-relational settings, as well as proofs of the non-uniqueness of the complex embeddings for a given relation. Bounds on the rank of the proposed decomposition are also demonstrated and discussed. The learning algorithm is provided in more details, and more experiments are provided, especially regarding the training time of the models.”

Improving Knowledge Graph Embedding Using Simple Constraints (May 2018) [code] investigated the potential of using very simple constraints to improve KG embedding. They examined non-negativity constraints on entity representations and approximate entailment constraints on relation representations (hence, ComplEx-NNE+AER – an extension of the original ComplEx.

arxiv1805.02408-t3+t4.png

[Image source. Click image to open in new window.]


Non-negativity constraints on entity representations in that work helped learn compact and interpretable representations for entities, whereas approximate entailment constraints on relation representations further encoded regularities of logical entailment between relations into their distributed representations. The constraints imposed prior beliefs upon the structure of the embedding space, without negative impacts on efficiency or scalability. For each relation, it assumed a score matrix whose sign matrix was partially observed. The entries corresponding to factual and nonfactual triples had the sign $\small 1$ and $\small -1$, respectively.

Notes:

  1. Regarding non-negativity constraints: for technical reasons the variables of linear programs must always take non-negative values; for example, the linear inequalities x ≥ 0 and y ≥ 0 specify that you cannot produce a negative number of items.

  2. Regarding approximate_entailment_constraints, these authors mean an ordered pair of relations such that the former approximately entails the latter – e.g., BornInCountry and Nationality – stating that a person born in a country is very likely, but not necessarily, to have a nationality of that country.]

Holographic embeddings are an approach to generating entity embeddings from a list of (head, tail, relation) triples like (‘Jeff’, ‘Amazon’, ‘employer’) and (‘Zuck’, ‘Palo Alto’, ‘location’). Embeddings can be used as lossy (but memory-efficient) inputs to other machine learning models, used directly in triple inference (i.e. knowledge base completion, or link prediction) by evaluating the likelihood of candidate triples, or to search for associated entities using $\small k$-nearest neighbors. For example, a search from the embedding representing the entity ‘University of California, Berkeley’ yields the associated entities ‘UC Irvine’, ‘Stanford University’, ‘USC’, ‘UCLA’, and ‘UCSD’ (see the figure in this GitHub repo).

Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. Holographic Embeddings of Knowledge Graphs (Dec 2017) [author’s code]) – by Maximilian Nickel, Lorenzo Rosasco and Tomaso Poggio at MIT – proposed holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method was related to holographic models of associative memory in that it employed circular correlation to create compositional representations. By using correlation as the compositional operator HolE could capture rich interactions but simultaneously remained efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments they showed that holographic embeddings were able to outperform state of the art methods for link prediction in knowledge graphs and relational learning benchmark datasets.

arxiv-1510.04935.png

[Image source. Click image to open in new window.]


arxiv1510.04935-t2.png

[Image source. Click image to open in new window.]


  • What is a “holographic model?” Holographic Embeddings of Knowledge Graphs states “The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets.”

    arxiv-1510.04935b.png

    [Image source. Click image to open in new window.]


    The Wikipedia Cross-correlation page and this Quora post, What is the need for circular correlation and its advantages over a linear correlation? provide a basic description of the concept of circular correlation and its Fourier transformation, which is used in that paper, and discussed in the ComplEx / HolE  “comparison” paper.

  • These slides (starting at slide 7) describe holographic embeddings, Holographic Embeddings of Knowledge Graphs (2016; Maximilian Nickel, Lorenzo Rosasco and Tomaso Poggio)  [local copy].

  • Excellent non-author code, discussion and examples for HolE are found in this GitHub repository.

Knowledge graph embeddings have received significant attention due to their excellent performance for tasks like link prediction and entity resolution. In Complex and Holographic Embeddings of Knowledge Graphs: A Comparison (Jul 2017) Théo Trouillon (Université Grenoble Alpes) and Maximilian Nickel (Facebook AI Research | MIT) provided a comparison of two state of the art knowledge graph embeddings: ComplEx and HolE. They briefly reviewed both models, discussing how their scoring functions were equivalent, and then analyzed the discrepancy of results reported in the original articles – showing experimentally that they are likely due to the use of different loss functions. They also discussed advantages and disadvantages of both models and under which conditions one would be preferable to the other.

Critique: although they do not label the models in their Table 1, from their discussion of the loss models, “Margin” refers to their HolE model, whereas “Neg-LL” refers to Trouillon et al.’s ComplEx model.

arxiv1707.01475-t1-labeled.png

[Image source. Click image to open in new window.]


  • Though similar, the filtered MRR value (0.541) that they report for HolE on the FB15k dataset in this 2017 paper differs from the value (0.524) that they reported in their earlier 2016 and 2017 papers.

  • Likewise, the filtered MRR value (0.639) that they report for ComplEx on the FB15k dataset in this 2017 paper is similar but differs from the value (0.692) that they reported in their earlier 2016 and 2017 papers.

Analogical Inference for Multi-Relational Embeddings (Jul 2017) [code] imposed analogical properties on embeddings. Each relation was represented as a normal matrix, which were constrained to satisfy commutativity properties. ANALOGY represented a novel framework for optimizing the latent representations with respect to the “analogical” properties of the embedded entities and relations for knowledge base completion. Analogical inference posited that if subsets of entities and relations were analogous in systems A and B, then the unobserved triples in B could be inferred by mirroring their counterparts in A. The proposed approach obtained state of the art results on two popular benchmark datasets, outperforming a large number of strong baselines in most cases.

Related to analogical reasoning, note also that Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases described work that jointly learned concept vectors from a textual knowledge base (Wikipedia) and a graphical knowledge base (Probase), demonstrating superior results on analogical reasoning and concept categorization.

arxiv-1705.02426d.png

[Image source. Click image to open in new window.]


arxiv1705.02426-t3.png

[Image source. Click image to open in new window.]


Knowledge Graph Embedding with Iterative Guidance from Soft Rules (Nov 2017) [code] presented a novel approach to knowledge graph embedding (KGE) combined with guidance from soft logic rules, called Rule-Guided Embedding (RUGE), that provided state of the art results in KG link prediction (compared e.g. to ComplEx: their Table 3). This work builds on previous work by these authors (that provides additional information): Jointly Embedding Knowledge Graphs and Logical Rules (which introduced KALE, an approach that learned entity and relation embeddings by jointly modeling knowledge and logic), and Knowledge Base Completion using Embeddings and Rules (which employed an integer linear programming approach plus logic rules that leveraged prior knowledge in the KG to greatly reduce the solution space during inference, i.e. link prediction).

arxiv-1711.11231.png

[Image source. Click image to open in new window.]


arxiv1711.11231-t3.png

[Image source. Click image to open in new window.]


PredPath, described in Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs (Shi and Weninger, Apr 2016) [code] presented a discriminative path-based method for fact checking in KG that incorporated connectivity, type information, and predicate interactions. Given a statement in the form of a (subject, predicate, object) triple – for example, (Chicago, capitalOf, Illinois) – their approach mined discriminative paths that alternatively defined the generalized statement (U.S. city, predicate, U.S. state) and used the mined rules to evaluate the veracity of that statement.

arxiv1510.05911-f1.png

[Image source. Click image to open in new window.]


arxiv-1510.05911.png

[Image source. Click image to open in new window.]


arxiv1510.05911-t1.png

[Image source. Click image to open in new window.]


Follow-on work by the PredPath authors, ProjE: Embedding Projection for Knowledge Graph Completion (Shi and Weninger, Nov 2016) [code] presented a shared variable neural network model that filled in missing information in a KG by learning joint embeddings of the KG entities and edges (collectively calculating the scores of all candidate triples). ProjE had a parameter size smaller than 11 out of 15 existing methods while performing 37% better than the current best method on standard datasets. They also showed, via a fact checking task, that ProjE was capable of accurately determining the veracity of many declarative statements.

arxiv-1611.05425.png

[Image source. Click image to open in new window.]


arxiv1611.05425-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1611.05425-t4.png

[Image source. Click image to open in new window.]


ProjE authors Shi and Weninger) also proposed a new KGC task, the Open-World Knowledge Graph Completion (Nov 2017) [code]. As a first attempt to solve this task they introduced an open-world KGC model called ConMask, which learned embeddings of an entity’s name and parts of its text-description to connect unseen entities to the KG. To mitigate the presence of noisy text descriptions, ConMask used a relationship-dependent content masking to extract relevant snippets, and then trained a CNN to fuse the extracted snippets with entities in the KG. Experiments on large data sets showed that ConMask performed well in the open-world KGC task, even outperforming existing KGC models on the standard closed-world KGC task.

arxiv1711.03438-f1.png

[Image source. Click image to open in new window.]


arxiv1711.03438-f2.png

[Image source. Click image to open in new window.]


arxiv1711.03438-f3.png

[Image source. Click image to open in new window.]


arxiv1711.03438-t3+t4.png

[Image source. Click image to open in new window.]


Convolutional 2D Knowledge Graph Embeddings (Jul 2018) [code] introduced ConvE, a deeper, multi-layer convolutional network model for link prediction. ConvE was highly parameter efficient, yielding the same performance as DistMult and R-GCN [*R-GCN is discussed below] with 8x and 17x fewer parameters.

“Analysis of our model suggests that it is particularly effective at modelling nodes with high indegree – which are common in highly-connected, complex knowledge graphs such as Freebase and YAGO3. In addition, it has been noted that the WN18 and FB15k datasets suffer from test set leakage, due to inverse relations from the training set being present in the test set – however, the extent of this issue has so far not been quantified. We find this problem to be severe: a simple rule-based model can achieve state-of-the-art results on both WN18 and FB15k.”

arxiv-1707.01476.png

[Image source. Click image to open in new window.]


arxiv1707.01476-t3+t4.png

[Image source. Click image to open in new window.]


arxiv1707.01476-t5.png

[Image source. Click image to open in new window.]


ConvE was the inspiration – by different authors – for HypER (Hypernetwork Knowledge Graph Embeddings) (Aug 2018) [code]. HypER used a hypernetwork architecture to generate convolutional layer filters specific to each relation and apply those filters to the subject entity embeddings.

arxiv-1609.09106.png

[Image source. Click image to open in new window.]


arxiv-1609.09106b.png

[Image source  (local copy). Click image to open in new window.]


Their model (*HypER*) simplified the entity and relation embedding interactions introduced by ConvE while outperforming all previous approaches to link prediction across all standard link prediction datasets. HypER used a hypernetwork to generate weights of convolutional filters for each relation. [A hypernetwork is an approach of having one network generate weights for another network, which can be used to provide weight-sharing across layers and to dynamically synthesize weights based on an input.] In this work, they generated relation-specific weights to process input entities, and also obtained multi-task knowledge sharing across different relations in the knowledge graph. Whereas ConvE had a global common set of 2-dimensional convolutional filters (suggesting the presence of a 2D structure in word embeddings) that reshaped and concatenated subject entity and relation embeddings, HypER used a hypernetwork to generate 1-dimensional relation-specific filters to process the unadjusted subject entity embeddings, thereby simplifying the interaction between subject entity and relation embeddings. Thus, instead of extracting a limited number of both entity-related and relation-related features from the dimensions around the concatenation point, HypER covered all subject entity embedding dimensions by sliding the relation-dependent convolutional filters over the whole entity embedding.

arxiv-1808.07018.png

[Image source  (see also). Click image to open in new window.]


arxiv1808.07018-t4+t5+t6.png

[Image source. Click image to open in new window.]


arxiv1808.07018-t7+t8+t9+t10.png

[Image source. Click image to open in new window.]


Link Prediction using Embedded Knowledge Graphs (Nov 2016; updated Apr 2018) by Microsoft Research and Google Research addressed the task of knowledge base completion by performing a single, short sequence of interactive lookup operations on an embedded knowledge graph which had been trained through end-to-end backpropagation to be an optimized and compressed version of the initial knowledge base. Their proposed model, Embedded Knowledge Graph Network (EKGN), achieved state of the art results on popular knowledge base completion benchmarks.

arxiv1611.04642-f1.png

[Image source. Click image to open in new window.]


arxiv1611.04642-f2.png

[Image source. Click image to open in new window.]


arxiv1611.04642-f3.png

[Image source. Click image to open in new window.]


arxiv1611.04642-t1.png

[Image source. Click image to open in new window.]


arxiv1611.04642-t2.png

[Image source. Click image to open in new window.]


In August 2018 A Capsule Network-based Embedding Model for Knowledge Graph Completion and Search Personalization introduced CapsE, a new model for knowledge graph completion. CapsE employed a capsule network to model (subject, relation, object) relationship triples. CapsE obtained state of the art link prediction results for knowledge graph completion on two benchmark datasets: WN18RR and FB15k-237. Comparing Table 2 in that paper to Table 4 in On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols appears to confirm CapsE as a state of the art model, the first to apply capsule_networks to knowledge graph completion and search personalization.

arxiv-1808.04122.png

[Image source. Click image to open in new window.]


arxiv1808.04122-t2.png

[Image source. Click image to open in new window.]


Predicting Semantic Relations using Global Graph Properties (Aug 2018) [code] combined global and local properties of semantic graphs through the framework of Max-Margin Markov Graph Models (M3GM), a novel extension of Exponential Random Graph Model (ERGM) that scales to large multi-relational graphs. They demonstrated how such global modeling improves performance on the local task of predicting semantic relations between synsets, yielding new state of the art results on the WN18RR dataset, a challenging version of WordNet link prediction in which “easy” reciprocal cases were removed. In addition, the M3GM model identified multirelational motifs that were characteristic of well-formed lexical semantic ontologies.

arxiv1808.08644-f1.png

[Image source. Click image to open in new window.]


arxiv1808.08644-t1+t2.png

[Image source. Click image to open in new window.]


One-Shot Relational Learning for Knowledge Graphs (Aug 2018) [code] by the University of California (Santa Barbara) and IBM Research observed that long-tail relations are common in KGs (in other words, they have very few instances) and those newly added relations often do not have many known triples for training. In this work, they aimed at predicting new facts under a challenging setting where only one training instance was available. They proposed a one-shot relational learning framework, which utilized the knowledge extracted by embedding models and learned a matching metric by considering both the learned embeddings and one-hop graph structures. Empirically, their model yielded considerable performance improvements over existing embedding models, and also eliminated the need of retraining the embedding models when dealing with newly added relations. These authors also prepared two new datasets, for their work.

[paraphrased:] “Existing benchmarks for knowledge graph completion, such as and YAGO3-10 are small subsets of real-world KGs. These datasets consider the same set of relations during training and testing and often include sufficient training triples for every relation. To construct datasets for one-shot learning, we go back to the original KGs and select those relations that do not have too many triples as one-shot task relations. We refer the rest of the relations as background relations, since their triples provide important background knowledge for us to match entity pairs. Our first dataset is based on NELL, a system that continuously collects structured knowledge by reading webs. We take the latest dump and remove those inverse relations. We select the relations with less than 500 but more than 50 triples 3 as one-shot tasks. To show that our model is able to operate on large-scale KGs, we follow the similar process to build another larger dataset based on Wikidata. The dataset statistics are shown in Table 1. Note that the Wiki-One dataset is an order of magnitude larger than any other benchmark datasets in terms of the numbers of entities and triples.”

arxiv1808.09040-f1.png

[Image source. Click image to open in new window.]


arxiv1808.09040-f2.png

[Image source. Click image to open in new window.]


arxiv1808.09040-t2.png

[Image source. Click image to open in new window.]


Most existing knowledge graph completion methods either focus on the positional relationship between entity pair and single relation (1-hop path) in semantic space, or concentrate on the joint probability of random walks on multi-hop paths among entities. However, they do not fully consider the intrinsic relationships of all the links among entities. By observing that the single relation and multi-hop paths between the same entity pair generally contain shared/similar semantic information, Hierarchical Attention Networks for Knowledge Base Completion via Joint Adversarial Training (Oct 2018) proposed a novel method for KB completion, which captured the features shared by different data sources utilizing hierarchical attention networks (HAN) and adversarial training (AT). The joint adversarial training with a gradient reversal layer (GRL) reversed the backpropagation gradient to allow the feature extractor to extract the shared features between different data sources. The HANs automatically encoded the inputs into low-dimensional vectors and exploited two partial parameter-shared components: one for feature source discrimination, and the other for determining missing relations. By joint adversarial training (AT) the entire model, their method minimized the classification error of missing relations. The AT mechanism encouraged their model to extract features that were both discriminative for missing relation prediction, and shareable between single relation and multi-hop paths.

arxiv1810.06033-fig1.png

[Image source. Click image to open in new window.]


arxiv1810.06033-fig2.png

[Image source. Click image to open in new window.]


arxiv1810.06033-table2.png

[Image source. Click image to open in new window.]





My Table 1, below, summarizes MRR scores for link prediction for selected, best-performing embedding models ca. mid-2018. Around that time, ComplEx was employed by Facebook AI Research as the state of the art for comparison of various models in Canonical Tensor Decomposition for Knowledge Base Completion  [code], which framed knowledge base completion as a 3rd-order binary tensor completion problem.

Table 1. Link prediction results: comparison of various embedding models (filtered MRR metric1)
General notes.
1. This list is current to approximately late August, 2018.
2. I did not exhaustively search the literature, only the papers mentioned in this subsection.
3. From those papers, I selected ("cherry picked") the data from some of the better-performing models discussed in this subsection, and from those again selected those models that are discussed herein. I also only collected/reported the MRR scores, as I wanted an uncluttered comparison of those models.
4. References cited may not be the primary source (refer to the references cited therein).
Table footnotes.
1Filtered MRR: filtered MRR.
2 Datasets:

    WN18 is a subset of WordNet which consists of 18 relations and 40,943 entities ("generic facts"). Most of the 151,442 triples consist of hyponym and hypernym relations and, for such a reason, WN18 tends to follow a strictly hierarchical structure.

    WN18RR corrects flaws in WN18:

      "WN18RR reclaims WN18 as a dataset, which cannot easily be completed using a single rule, but requires modeling of the complete knowledge graph. WN18RR contains 93,003 triples with 40,943 entities and 11 relations. For future research, we recommend against using FB15k and WN18 and instead recommend FB15k-237, WN18RR, and YAGO3-10."

      "A popular relation prediction dataset for WordNet is the subset curated as WN18, containing 18 relations for about 41,000 synsets extracted from WordNet 3.0. It has been noted that this dataset suffers from considerable leakage: edges from reciprocal relations such as hypernym/hyponym appear in one direction in the training set and in the opposite direction in dev/test. This allows trivial rule-based baselines to achieve high performance. To alleviate this concern, Dettmers et al. (2018) released the WN18RR set, removing seven relations altogether. However, even this dataset retains four symmetric relation types: 'also see', 'derivationally related form', 'similar to', and 'verb group'. These symmetric relations can be exploited by defaulting to a simple rule-based predictor." [Source: Section 4.1 in Predicting Semantic Relations using Global Graph Properties; references therein.]
    FB15k is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. A large fraction of content in this knowledge graph describes facts about movies, actors, awards, sports, and sport teams.

    FB15k-237  (see also), which corrects errors in FB15k, contains about 14,541 entities with 237 different relations. "This dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. The knowledge base triples are a subset of the FB15k set. The textual mentions are derived from 200 million sentences from the ClueWeb12 corpus coupled with FACC1 Freebase entity mention annotations."

    YAGO3-10 is a subset of YAGO3 which consists of entities which have a minimum of 10 relations each. It has 123,182 entities and 37 relations. Most of the triples deal with descriptive attributes of people, such as citizenship, gender, and profession.

    YAGO37 is extracted from the core facts of YAGO3, containing 37 relations and 123,189 entities. This dataset was created by the RUGE authors: "The FB15k data set consists of 1,345 relations and 14,951 entities among them. The training set contains 483,142 triples, the validation set 50,000 triples, and the test set 59,071 triples. 454 rules are created for FB15k. The YAGO37 data set consists of 37 relations and 123,189 entities among them. The training set contains 989,132 triples, the validation set 50,000 triples, and the test set 50,000 triples. 16 rules are created for YAGO37. All triples are unique and we made sure that all entities/relations appearing in the validation or test sets were occurring in the training set."
Dataset2
Model [ref] FB15k FB15k-237 WN18 WN18RR YAGO3-10 YAGO37
ANALOGY [1] 0.723 0.211 0.942 0.391 0.257
ANALOGY [7] 0.725 0.942
CapsE [8] 0.538 0.391
ComplEx [1] 0.716 0.206 0.942 0.390 0.266
ComplEx [10] 0.692 0.247 0.941 0.440 0.340
ComplEx [11] 0.69 0.240 0.941 0.444
ComplEx [5] 0.692 0.941
ComplEx [6] 0.639 0.941
ComplEx-NNE+AER [3] 0.803 0.943
ConvE [10] 0.657 0.325 0.943 0.430 0.440
ConvE [4] 0.745 0.942
ConvE [8] 0.316 0.460
ConvE [9] 0.301 0.342
ConvE [11] 0.745 0.301 0.942 0.342
DistMult [10] 0.654 0.241 0.822 0.430 0.340
DistMult [11] 0.35 0.241 0.83 0.425
DistMult [3] 0.644 0.365
DistMult [5] 0.654 0.822
HolE [6] 0.541 0.938
HolE [5] 0.524 0.938
HypER [2] 0.762 0.341 0.951 0.463
M3GM [12] 0.4983
ProjE [1] 0.588 0.249 0.820 0.367 0.470
R-GCN [10] 0.696 0.248 0.814
R-GCN+ [9] 0.696 0.249 0.819
RUGE [3] 0.768 0.431
TransE [1] 0.456 0.219 0.584 0.191 0.151
TransE [3] 0.400 0.303
TransE [5] 0.380 0.454
TransE [8] 0.294 0.226
TransE [9] 0.463 0.294 0.495 0.226
TransE [12] 0.4659
TransF [11] 0.564 0.286 0.856 0.505

From the table above we can see that there is considerable variation among the filtered MRR for link prediction by the various models on the specified datasets. Per the footnotes in that table – although there are fewer comparisons at present – we should probably assign greater performance to models that score well on the more challenging datasets:

Knowledge graph embedding has been an active research topic for knowledge base completion, with progressive improvement from the initial TransE, TransH, DistMult etc. to the current state of the art ConvE. ConvE uses 2D convolution over embeddings and multiple layers of nonlinear features to model knowledge graphs. The model can be efficiently trained and scalable to large knowledge graphs. However, there is no structure enforcement in the embedding space of ConvE. The recent graph convolutional network (GCN) provides another way of learning graph node embedding by successfully utilizing graph connectivity structure.

End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion (Nov 2018) proposed a novel end-to-end Structure-Aware Convolutional Networks (SACN ) that took the benefit of GCN and ConvE together. SACN consisted of an encoder of a weighted graph convolutional network (WGCN), and a decoder of a convolutional network called Conv-TransE. WGCN utilized knowledge graph node structure, node attributes and relation types. It had learnable weights that collected adaptive amounts of information from neighboring graph nodes, resulting in more accurate embeddings of graph nodes. In addition, the node attributes were added as the nodes and were easily integrated into the WGCN. The decoder Conv-TransE extended the state of the art ConvE to be translational between entities and relations while keeping the state of the art performance as ConvE. They demonstrated the effectiveness of their proposed SACN model on standard FB15k-237 and WN18RR datasets, and presented about 10% relative improvement over the state of the art ConvE in terms of HITS@1, HITS@3 and HITS@10.

arxiv1811.04441-f1.png

[Image source. Click image to open in new window.]


arxiv1811.04441-f2.png

[Image source. Click image to open in new window.]


arxiv1811.04441-t3.png

[Image source. Click image to open in new window.]


In an interesting approach, Finding Streams in Knowledge Graphs to Support Fact Checking (Aug 2017) [code] viewed a knowledge graph as a “flow network” and knowledge as a fluid, abstract commodity. They showed that computational fact checking of a (subject, predicate, object) triple then amounts to finding a “knowledge stream” that emanates from the subject node and flows toward the object node through paths connecting them. Evaluations of their models (KS: Knowledge Stream and KL-REL: Relational Knowledge Linker) revealed that this network-flow model was very effective in discerning true statements from false ones, outperforming existing algorithms on many test cases. Moreover, the model was expressive in its ability to automatically discover useful path patterns and relevant facts that may help human fact checkers corroborate or refute a claim.

arxiv1708.07239-f5.png

[Image source. Click image to open in new window.]


arxiv1708.07239-t2+t3.png

[Image source. Click image to open in new window.]


arxiv1708.07239-t4.png

[Image source. Click image to open in new window.]


Predictive Network Representation Learning for Link Prediction (2017) [pdf] addressed the structural link prediction, which infers missing links on a static network. Their proposed model, Predictive Network Representation Learning (PNRL), defined two learning objectives: observed structure preservation, and hidden link prediction. [Network representation learning models learn the latent representations of nodes, which can embed the rich structural information into the latent space. Most of these models are learned in an unsupervised manner.] To integrate the two objectives in a unified model, they developed an effective sampling strategy to select certain edges in a given network as assumed hidden links and regard the rest network structure as observed when training the model. By jointly optimizing the two objectives, the model not only enhanced the predictive ability of node representations, but also learned additional link prediction knowledge in the representation space.

Wang2017predictive-f1.png

[Image source. Click image to open in new window.]


Wang2017predictive-t2.png

[Image source. Click image to open in new window.]


The use of drug combinations, termed polypharmacy, is common for treating patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects may emerge because of drug-drug interactions, in which activity of one drug may change favorably or unfavorably if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.

Modeling Polypharmacy Side effects with Graph Convolutional Networks (Jul 2018) [project  (code/datasets);  discussion] – by Jure Leskovec and colleagues at Stanford University – presented Decagon, an approach for modeling polypharmacy side effects. The approach constructed a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which were represented as drug-drug interactions, where each side effect was an edge of a different type. Decagon was developed specifically to handle multimodal graphs with a large number of edge types. Their approach developed a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon could predict the exact side effect, if any, through which a given drug combination manifests clinically.

Decagon accurately predicted polypharmacy side effects, outperforming baselines by up to 69%. They found that it automatically learned representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon modeled particularly well polypharmacy side effects that had a strong molecular basis, while on predominantly non-molecular side effects it achieved good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.

Decagon2018-f1.png

[Image source. Click image to open in new window.]


Decagon2018-GCN.png

[Image source. Click image to open in new window.]


Decagon2018-encoder.png

[Image source. Click image to open in new window.]


Decagon2018-f3.png

[Image source. Click image to open in new window.]


Decagon2018-f4.png

[Image source. Click image to open in new window.]


Current KG completion models compel two-thirds of a triple provided (e.g., subject and relation) to predict the remaining one. DSKG: A Deep Sequential Model for Knowledge Graph Completion (Oct 2018) [code] proposed a new model which used a KG-specific multi-layer recurrent neutral network (RNN) to model triples in a KG as sequences. DSKG outperformed several state of the art KG completion models on the conventional entity prediction task for many evaluation metrics, based on two benchmark datasets (FB15kWN18) and a more difficult dataset (FB15k-237). Furthermore, their model was capable of predicting the triples, given only one entity.

arxiv1810.12582-f1.png

[Image source. Click image to open in new window.]


arxiv1810.12582-t1.png

[Image source. Click image to open in new window.]


arxiv1810.12582-t2+t3.png

[Image source. Click image to open in new window.]


Many knowledge graph embedding methods operate on triples and are therefore implicitly limited by a very local view of the entire knowledge graph. MOHONE: Modeling Higher Order Network Effects in KnowledgeGraphs via Network Infused Embeddings (Nov 2018) presented a new framework, MOHONE, to effectively model higher order network effects in knowledge-graphs thus enabling one to capture varying degrees of network connectivity (from the local to the global). Their framework was generic, explicitly modeled the network scale, and captured two different aspects of similarity in networks: (a) shared local neighborhood and (b) structural role-based similarity. They first introduced methods that learned network representations of entities in the knowledge graph, capturing these varied aspects of similarity. They then proposed a fast, efficient method to incorporate the information captured by these network representations into existing knowledge graph embeddings. Their method consistently and significantly improved the performance on link prediction of several different knowledge-graph embedding methods including TransE, TransD, DistMult, and ComplEx (by at least 4 points or 17% in some cases).

For a related paper by the same authors contemporaneously released on arXiv, see *DOLORES*: Deep Contextualized Knowledge Graph Embeddings.

arxiv1811.00198-f1.png

[Image source. Click image to open in new window.]


arxiv1811.00198-f2+f3.png

[Image source. Click image to open in new window.]


arxiv1811.00198-t2+t3.png

[Image source. Click image to open in new window.]


An interesting approach to KG construction involves the Biological Expression Language (BEL), which represents findings in the life sciences in a computable form. Biological statements in BEL are represented as (subject, predicate, object) triples, where the subject is a BEL term, the predicate is a biological relationship that connects the subject with the object (which can be a BEL term or a BEL statement). Hence, the knowledge captured by BEL statements can be represented as a graph. A BEL term represent the abundance of a biological entity, or a biological process such as a Gene Ontology entry or a disease. Each entity is described in an existing namespace (describing, for example, it’s source database, e.g. ChEBI), or a user-defined namespace. A fixed set of causal, correlative and other relationships link the entities: each statement can be associated with metadata that contextualizes it, for example qualifying it to be true only in specific tissues.

While BEL is an active research area with Tracks in the BioCreative VI Community Challengesp (see also) and preceding BioCreative Tracks, the approach appears to have gained little traction outside that community. For example, only a handful of BEL-related papers are present in PubMed that are not focused on the development of BEL, itself. For more information on BEL, see:

  • Training and Evaluation Corpora for the Extraction of Causal Relationships Encoded in Biological Expression Language (BEL) (2016), which described the corpus prepared for the BioCreative V BEL track, which also provides an excellent review of the Biological Expression Language:

    PMC4995071-f1.png

    [Image source. Click image to open in new window.]


  • Saved by the BEL: Ringing in a Common Language for the Life Sciences (2012):

    BEL2012-f2.png

    [Image source. Click image to open in new window.]


    BEL2012-f3.png

    [Image source. Click image to open in new window.]


  • BEL 2.0 Specification, home of the BEL Language Documentation v2.0;
  • OpenBEL;
  • PyBEL Python ecosystem;

  • Navigating the Disease Landscape: Knowledge Representations for Contextualizing Molecular Signatures (Apr 2018): the “Contextualization by Knowledge Graph Representations” subsection, pp. 8-9, summarizes some biomedical applications of BEL:

    Saqi2018-f1.png

    [Image source. Click image to open in new window.]


    Saqi2018-f2.png

    [Image source. Click image to open in new window.]


    Saqi2018-f3.png

    [Image source. Click image to open in new window.]


    Saqi2018-f4.png

    [Image source. Click image to open in new window.]


    Saqi2018-f1.png

  • The Causal Biological Network database  (Causal Biological Network Database: A Comprehensive Platform of Causal Biological Network Models Focused on the Pulmonary and Vascular Systems (2015)), “… a set of biological network models scripted in BEL that reflect causal signaling pathways across a wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and cardiovascular context. This comprehensive collection of networks is now freely available to the scientific community in a centralized web-based repository, composed of over 120 manually curated and well annotated biological network models supported by over 80,000 unique pieces of evidence from the scientific literature.

    PMC4401337-f2.png

    [Image source. Click image to open in new window.]


    PMC4401337-f3.png

    [Image source. Click image to open in new window.]


[Table of Contents]

Knowledge Discovery

KG are ideally-suited for knowledge discovery. Notable examples in the biomedical domain include the following papers.

In Exploiting Semantic Patterns Over Biomedical Knowledge Graphs for Predicting Treatment and Causative Relations (Jun 2018) [pdf], the authors first built a large knowledge graph of biomedical relations obtained from the National Library of Medicine (NLM)’s Unified Medical Language System Metathesaurus (UMLS). They then took a different approach to predicting potential relations between arbitrary pairs of biomedical entities. They refrained from NLP approaches that looked at individual sentences to extract a potential relation, instead exploiting semantic path patterns over this graph to build models for specific predicates. Instead of looking at what a particular sentence conveys, they modeled their prediction problem at a global level, outputting probability estimates of whether a pair of entities participated in a particular relation. A different binary classification model was trained for each predicate. While the approach was demonstrated using the “TREATS” and “CAUSES” predicates drawn from the UMLS Metathesaurus SemMedDB (Semantic Medline Database), their method also generalized to other predicates (such as “DISRUPTS” and “PREVENTS”), and could also complement other lexical and syntactic pattern-based distant supervision approaches for relation extraction.

PMC6070294-excerpt.png

[Image source. Click image to open in new window.]


PMC6070294-f1+f2.png

[Image source. Click image to open in new window.]


PMC6070294-f3.png

[Image source. Click image to open in new window.]


PMC6070294-f4.png

[Image source. Click image to open in new window.]


MOLIERE: Automatic Biomedical Hypothesis Generation System (May 2017) [projectcode] is a system that can identify connections within biomedical literature. MOLIERE utilized a multi-modal/multi-relational network of biomedical objects (papers; keywords; genes; proteins; diseases; diagnoses) extracted from PubMed. MOLIERE finds the shortest path between two query keywords in the KG, and extends this path to identify a significant set of related abstracts (which, due to the network construction process, share common topics). Topic modeling, performed on these documents using PLDA+, returns a set of plain text topics representing concepts that likely connect the queried keywords, supporting hypothesis generation (for example, on historical findings MOLIERE showed the implicit link between Venlafaxine and HTR1A, and the involvement of DDX3 on Wnt signaling).

arxiv1702.06176-f3descr.png

[Image source. Click image to open in new window.]


arxiv1702.06176-excerpt.png

[Image source. Click image to open in new window.]


arxiv1702.06176-f4.png

[Image source. Click image to open in new window.]


  • A: abstract layer; K: keyword layer:

    “In order to create edges between A and K, we used a simple metric of term frequency-inverse document frequency (TF-IDF). UMLS provides not only a list of keywords, but all known synonyms for each keyword. For example, the keyword Color C0009393 has the American spelling, the British spelling, and the pluralization of both defined as synonyms. Therefore we used the raw text abstracts and titles (before running the SPECIALIST NLP tools) to calculate tf-idf. In order to quickly count all occurrences of UMLS keywords across all synonyms, we implemented a simple parser. This was especially important because many keywords in UMLS are actually multi-word phrases such as ‘Clustered Regularly Interspaced Short Palindromic Repeats’ (a.k.a. CRISPR) C3658200 . …“

arxiv1702.06176-f5+f6.png

[Image source. Click image to open in new window.]


Distilled Wasserstein Learning for Word Embedding and Topic Modeling (Sep 2018) proposed a novel Wasserstein method with a distillation mechanism, yielding joint learning of word embeddings and topics. The proposed method – distilled Wasserstein learning (DWL) – was based on the fact that the Euclidean distance between word embeddings may be employed as the underlying distance in the Wasserstein topic model. The word distributions of topics, their optimal transports to the word distributions of documents, and the embeddings of words were learned in a unified framework. Evaluated on patient admission records, the proposed method embedded disease codes and procedures and learned the admission topics, obtaining superior performance on clinically-meaningful disease network construction, mortality prediction as a function of admission codes, and procedure recommendation.

arXiv-1809.04705.png

[Image source. Click image to open in new window.]


arxiv1809.04705-t1+t2.png

[Image source. Click image to open in new window.]


arxiv1809.04705-f2+t3.png

[Image source. Click image to open in new window.]


arxiv1809.04705-f4.png

[Image source. Click image to open in new window.]


Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases (Aug 2018) [code/data (the URL for the code is broken)] described work that jointly learned concept vectors from a textual knowledge base (Wikipedia) and a graphical knowledge base (Probase), demonstrating superior results (their Table 1) on analogical reasoning and concept categorization. The authors employed the skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph, in an unsupervised approach to argument-type identification for neural semantic parsing.

Recall that:

  • the QA4IE question answering framework likewise processed input documents, along with a knowledge base (the Wikipedia Ontology, to produce high quality relation triples, and

  • ANALOGY also employed analogical inference (for link prediction).

arxiv1801.00388-f1.png

[Image source. Click image to open in new window.]


arxiv1801.00388-t1+t2.png

[Image source. Click image to open in new window.]


In Learning a Health Knowledge Graph from Electronic Medical Records (Jul 2017) [data], maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier, and a Bayesian network using noisy-OR gates. Logistic regression, widely used for binary classification, was chosen as an example of a well-established machine learning classifier with interpretable parameters which is frequently used for modeling binary outcomes. Naive Bayes was chosen as it provides a baseline of what can be inferred from simple pairwise co-occurrences. Noisy-OR was chosen as an example of a probabilistic model that jointly models diseases and symptoms; similar models have successfully been used in previous medical diagnosis applications. Parameters for all three models were learned using maximum likelihood estimation. A graph of disease-symptom relationships was elicited from the learned parameters, and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. The noisy-OR model significantly outperformed the other tested models, producing a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation.

PMC5519723-f1.png

[Image source. Click image to open in new window.]


PMC5519723-f2.png

[Image source. Click image to open in new window.]


PMC5519723-t1.png

[Image source. Click image to open in new window.]


PMC5519723-t3.png

[Image source. Click image to open in new window.]


  • Careful reading of that paper and its Appendix, which describes the assumptions associated with each of those three models, clarifies the approach. Diseases (and their effects: symptoms) were separately modeled by logistic regression and naive Bayes; additionally, in naive Bayes, all symptoms were assumed to be conditionally independent from one another. However, the noisy-OR model jointly models all diseases and symptoms [wherein the central assumption of independence of effects (symptoms) is maintained]. Noisy-OR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). For example, given that patients tend to present with few diseases, the presence of one disease typically lowers the probability of others. In the logistic regression and naive Bayes models, however, since each disease is modeled separately (resulting in an assumption of independence between diseases), the presence of one disease does not (for example) rule out or lessen the probability of another disease (i.e. diagnosis).

  • Author David Sontag also evaluates noisy-or maximum likelihood estimation in Clinical Tagging with Joint Probabilistic Models ( Sep 2016). See also Sontag’s paper, Unsupervised Learning of Noisy-Or Bayesian Networks (Sep 2013).

    arxiv1309.6834-f1+f2.png

    [Image source. Click image to open in new window.]


  • Code for Learning a Health Knowledge Graph from Electronic Medical Records is not available.

The noisy-OR model was also employed in a 2009 paper [different authors], Biomedical Discovery Acceleration, with Applications to Craniofacial Development:

“… the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data.

“Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data.

“An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.”

  • They also stated (paraphrased):

    “One of the most popular methods to combine individual reliabilities is to assume independence of experts (naive Bayes assumption) and compute the integrated likelihood P for each relationship using the Noisy-OR function … The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support. This property is especially relevant in biology, where it is often difficult to identify false negatives; a given assertion is strengthened by additional information but unlike the case for estimating the reliability of an expert on the whole, an individual assertion is not penalized for lack of additional evidence. Moreover, since the experts are assumed to be independent, experts can be removed from or added to the analysis without excessive re-computation.”

PMC2653649-f1.png

[Image source. Click image to open in new window.]


PMC2653649-f2.png

[Image source. Click image to open in new window.]


PMC2653649-f4.png

[Image source. Click image to open in new window.]


PMC2653649-f7.png

[Image source. Click image to open in new window.]


PMC2653649-f9.png

[Image source. Click image to open in new window.]


PMC2653649-f11.png