Technical Review

Natural Language Processing

Last modified: 2018-12-11

Copyright notice, citation: Copyright
© 2018-present, Victoria A. Stuart

These Contents

[Table of Contents]


Natural language processing (NLP), a branch of machine learning (ML), is foundational to all information extraction and natural language tasks. Recent reviews of NLP relevant to this TECHNICAL REVIEW include:

Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on, my RSS feeds, and other sources. I firmly believe that this rapid progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).

Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress  (alternate link).

Basic steps associated with NLP include text retrieval and preprocessing steps, including:

Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.

Some recent ML approaches to NLP tasks include:

  • Classification:
    • A Deep Neural Network Sentence Level Classification Method with Context Information
    • A Comparative Study of Neural Network Models for Sentence Classification
    • ML-Net: Multi-label Classification of Biomedical Texts with Deep Neural Networks [Summary]In multi-label text classification, each textual document can be assigned with one or more labels. Due to this nature, the multi-label text classification task is often considered to be more challenging compared to the binary or multi-class text classification problems. As an important task with broad applications in biomedicine such as assigning diagnosis codes, a number of different computational methods (e.g. training and combining binary classifiers for each label) have been proposed in recent years. … We propose ML-Net, a novel deep learning framework, for multi-label classification of biomedical texts. As an end-to-end system, ML-Net combines a label prediction network with an automated label count prediction mechanism to output an optimal set of labels by leveraging both predicted confidence score of each label and the contextual information in the target document. We evaluate ML-Net on three independent, publicly-available corpora in two kinds of text genres: biomedical literature and clinical notes. …
    • Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings [Summary]… We propose a novel approach that combines context-sensitive word embeddings with a simple fully-connected classifier to perform classification of domains based on word-level information. The word embeddings were pretrained on a large unrelated corpus and left frozen during the training on domain data. … We show that this architecture reliably outperformed existing techniques on wordlist-based DGA families with just 30 DGA training examples and achieved state-of-the-art performance with around 100 DGA training examples, all while requiring an order of magnitude less time to train compared to current techniques. Of special note is the technique’s performance on the matsnu DGA : the classifier attained a 89.5% detection rate with a 1:1,000 false positive rate (FPR) after training on only 30 examples of the DGA domains, and a 91.2% detection rate with a 1:10,000 FPR after 90 examples. Considering that some of these DGAs have wordlists of several hundred words, our results demonstrate that this technique does not rely on the classifier learning the DGA wordlists. Instead, the classifier is able to learn the semantic signatures of the wordlist-based DGA families.
    • Comparative Document Summarisation via Classification [Summary][Dec 2018; code] This paper considers extractive summarisation in a comparative setting: given two or more document groups (e.g., separated by publication time), the goal is to select a small number of documents that are representative of each group, and also maximally distinguishable from other groups. We formulate a set of new objective functions for this problem that connect recent literature on document summarisation, interpretable machine learning, and data subset selection. In particular, by casting the problem as a binary classification amongst different groups, we derive objectives based on the notion of maximum mean discrepancy, as well as a simple yet effective gradient-based optimisation strategy. Our new formulation allows scalable evaluations of comparative summarisation as a classification task, both automatically and via crowd-sourcing. To this end, we evaluate comparative summarisation methods on a newly curated collection of controversial news topics over 13 months. We observe that gradient-based optimisation outperforms discrete and baseline approaches in 15 out of 24 different automatic evaluation settings. In crowd-sourced evaluations, summaries from gradient optimisation elicit 7% more accurate classification from human workers than discrete optimisation. Our result contrasts with recent literature on submodular data subset selection that favours discrete optimisation. We posit that our formulation of comparative summarisation will prove useful in a diverse range of use cases such as comparing content sources, authors, related topics, or distinct view points.
  • Clustering:
  • Constituency parsing:
  • Coreference resolution:
  • Embeddings:

  • Event detection/extraction:
    • Joint Extraction of Events and Entities within a Document Context [Summary]Events and entities are closely related; entities are often actors or participants in events and events without entities are uncommon. The interpretation of events and entities is highly contextually dependent. Existing work in information extraction typically models events separately from entities, and performs inference at the sentence level, ignoring the rest of the document. In this paper, we propose a novel approach that models the dependencies among variables of events, entities, and their relations, and performs joint inference of these variables across a document. The goal is to enable access to document-level contextual information and facilitate context-aware predictions. We demonstrate that our approach substantially outperforms the state-of-the-art methods for event extraction as well as a strong baseline for entity extraction. [Code]
    • Semi-Supervised Event Extraction with Paraphrase Clusters
    • Event Detection with Neural Networks: A Rigorous Empirical Evaluation
    • Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation
    • Biomedical Event Extraction Based on GRU Integrating Attention Mechanism
    • One for All: Neural Joint Modeling of Entities and Events [Summary“The previous work for event extraction has mainly focused on the predictions for event triggers and argument roles, treating entity mentions as being provided by human annotators. This is unrealistic as entity mentions are usually predicted by some existing toolkits whose errors might be propagated to the event trigger and argument role recognition. Few of the recent work has addressed this problem by jointly predicting entity mentions, event triggers and arguments. However, such work is limited to using discrete engineering features to represent contextual information for the individual tasks and their interactions. In this work, we propose a novel model to jointly perform predictions for entity mentions, event triggers and arguments based on the shared hidden representations from deep learning. The experiments demonstrate the benefits of the proposed method, leading to the state-of-the-art performance for event extraction.”

      “… In the future, we plan to improve the end-to-end model so EE can be solved from just the raw sentences and the word embeddings.”
       |  Image]

Again, that is not an exhaustive list – merely some articles that I have recently encountered that are relevant to my interests.

[Table of Contents]

NLP: Selected Papers

Cross-sentence $\small n$-ary relation extraction detects relations among $\small n$ entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG [directed acyclic graph], adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM  [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given

“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”

… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state of the art system of Peng et al. (2017) by 1.2%.

Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified with regard to the edge labels (discussed by Song et al. in their paper).


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction (Nov 2018) proposed a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploited word embeddings and positional embeddings for cross-sentence $\small n$-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture the most informative features, which are essential for cross-sentence $\small n$-ary relation extraction. Their LSTM-CNN model was evaluated on standard datasets for cross-sentence $\small n$-ary relation extraction, where it significantly outperformed baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also showed that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence $\small n$-ary relation extraction.

  • “However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in $\small \text{LISTING 1}$, there exists a ternary relation response across three entities: $\small \text{EGFR}$ , $\small \text{L858E}$ , $\small \text{gefitnib}$ appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence $\small n$-ary relation extraction.

      $\small \text{Listing 1: Text span of two consecutive sentences}$

      'The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10.   All patients were treated with gefitnib and showed a partial response. '

    “This paper focuses on the cross-sentence $\small n$-ary relation extraction task. Formally, let $\small \{e_1, \ldots ,e_n\}$ be the set of entities in a text span $\small S$ containing $\small t$ number of consecutive sentences. For example, in the text span comprising 2 sentences ($\small t = 2$) in $\small \text{Listing 1}$ above, given cancer patients with mutation $\small v$ (EGFR) in gene $\small g$ (L858E), the patients showed a partial response to drug $\small d$ (gefitnib). Thus, a ternary relation response ( $\small \text{EGFR}$, $\small \text{L858E}$, $\small \text{gefitnib}$) exists among the three entities spanning across the two sentences in $\small \text{Listing 1}$.”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Likewise, Neural Relation Extraction Within and Across Sentence Boundaries (Oct 2018) proposed an architecture for relation extraction in entity pairs spanning multiple sentences: inter-sentential dependency-based neural networks (iDepNN). iDepNN modeled the shortest and augmented dependency paths via recurrent and recursive neural networks to extract relationships within (intra-) and across (inter-) sentence boundaries. Compared to SVM and neural network baselines, iDepNN was more robust to false positives in relationships spanning sentences. The authors evaluated their model on four datasets from newswire (MUC6) and medical (BioNLP shared tasks) domains, that achieved state of the art performance and showed a better balance in precision and recall for inter-sentential relationships – performing better than 11 teams participating in the BioNLP Shared Task 2016, achieving a gain of 5.2% (0.587 vs 0.558) in $\small F_1$ over the winning team. They also released the crosssentence annotations for MUC6.


[Image source. . Click image to open in new window.]


[Image source. . Click image to open in new window.]


[Image source. . Click image to open in new window.]

Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. They showed that their model was able to capture features and interactions (the model was robust in handling both overlapping and non-overlapping mentions) that could not be captured by previous models, while maintaining a low time complexity for inference.

In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (above), Learning to Recognize Discontiguous Entities (Oct 2018) focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.

Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. GraphIE: A Graph-Based Framework for Information Extraction (Oct 2018) is a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagated information between connected nodes through graph convolutions and exploited the richer representation to improve word level predictions. Results on three different tasks – social media, textual and visual information extraction – showed that GraphIE outperformed a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

While character-based neural models have proven useful for many NLP tasks, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) first investigated the gaps between methods for learning word and sentence representations. Furthermore, they proposed IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. They evaluated their model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. IntNet significantly outperformed other character embedding models and obtained new state of the art performance without relying on any external knowledge or resources.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Natural Language Processing:

Additional Reading

  • Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016) [codediscussion]

    “Relation classification is an important semantic processing task in the field of natural language processing (NLP). State of the art systems still rely on lexical resources such as WordNet or NLP systems like dependency parser and named entity recognizers (NER) to get high-level features. Another challenge is that important information can appear at any position in the sentence. To tackle these problems, we propose Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM) to capture the most important semantic information in a sentence. The experimental results on the SemEval-2010 relation classification task show that our method outperforms most of the existing methods, with only word vectors.”


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Gradual Machine Learning for Entity Resolution (Oct 2018)

    “Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning [GML], which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.”


    [Image source. Click image to open in new window.]


    [Image sourceGML: gradual machine learning (this paper); UR: unsupervised rule-based; UC: unsupervised clustering; SVM: support vector machine; DNN: Deep Learning.  Click image to open in new window.]

    Our evaluation is conducted on three real datasets, which are described as follows:

    • DS (DBLP-Scholar 3): The DS dataset contains the publication entities from DBLP and the publication entities from Google Scholar. The experiments match the DBLP entries with the Scholar entries.
    • AB (Abt-Buy 4): The AB dataset contains the product entities from both and The experiments match the Abt entries with the Buy entries.
    • SG (Songs 5): The SG dataset contains song entities, some of which refer to the same songs. The experiments match the song entries in the same table.
  • Neural CRF Transducers for Sequence Labeling (Nov 2018)

    “Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.”


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Comparison of Named Entity Recognition Methodologies in Biomedical Documents (Nov 2018)

    “Background. Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers.
    Results. Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan-type and Elman-type algorithms have $\small F_1$ scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and Word2Vec have $\small F_1$ of 72.73%, 72.74%, and 72.82%, respectively.”

    “In this paper, we use five categories (protein, DNA, RNA, cell type, and cell line) instead of the categories used in the ordinary NER process. An example of the NER tagged sentence is as follows: ‘IL-2 [ B-protein ] responsiveness requires three distinct elements [ B-DNA ] within the enhancer [ B-DNA ].’


    [Image source. Click image to open in new window.]
  • Quantifying Uncertainties in Natural Language Processing Tasks (Nov 2018)

    “Reliable uncertainty quantification is a first step towards building explainable, transparent, and accountable artificial intelligent systems. Recent progress in Bayesian deep learning has made such quantification realizable. In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLP tasks.”

    “In this work, we evaluate the benefits of quantifying uncertainties in modern neural network models applied in the context of three different natural language processing tasks. We conduct experiments on sentiment analysis, named entity recognition, and language modeling tasks with convolutional and recurrent neural network models. We show that by quantifying both uncertainties, model performances are improved across the three tasks. We further investigate the characteristics of inputs with high and low data uncertainty measures in Yelp 2013 and CoNLL 2003 datasets. For both datasets, our model estimates higher data uncertainties for more difficult predictions.”


    [Image source. Click image to open in new window.]