Technical Review

Natural Language Processing

Last modified: 2019-08-16

Copyright notice, citation: Copyright
© 2018-present, Victoria A. Stuart

These Contents

[Table of Contents]


Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. NLP is foundational to all information extraction and natural language tasks.


[Image source (slide 8). Click image to open in new window.]

Recent reviews of NLP relevant to this TECHNICAL REVIEW include:

Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on, my RSS feeds, and other sources. I firmly believe that this rapid progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).

Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress  (alternate link).

Basic steps associated with NLP include text retrieval and preprocessing steps, including:

Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.

Some recent ML approaches to NLP tasks include:

Again, that is not an exhaustive list – merely some articles that I have recently encountered that are relevant to my interests.

[Table of Contents]

NLP: Selected Papers

Cross-sentence $\small n$-ary relation extraction detects relations among $\small n$ entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG [directed acyclic graph], adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM  [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given

“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”

… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state of the art system of Peng et al. (2017) by 1.2%.

Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified in regard to the edge labels (discussed by Song et al. in their paper).


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction (Nov 2018) proposed a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploited word embeddings and positional embeddings for cross-sentence $\small n$-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture the most informative features, which are essential for cross-sentence $\small n$-ary relation extraction. Their LSTM-CNN model was evaluated on standard datasets for cross-sentence $\small n$-ary relation extraction, where it significantly outperformed baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also showed that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence $\small n$-ary relation extraction.

  • “However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in $\small \text{LISTING 1}$, there exists a ternary relation response across three entities: $\small \text{EGFR}$ , $\small \text{L858E}$ , $\small \text{gefitnib}$ appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence $\small n$-ary relation extraction.

      $\small \text{Listing 1: Text span of two consecutive sentences}$

      'The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10.   All patients were treated with gefitnib and showed a partial response. '

    “This paper focuses on the cross-sentence $\small n$-ary relation extraction task. Formally, let $\small \{e_1, \ldots ,e_n\}$ be the set of entities in a text span $\small S$ containing $\small t$ number of consecutive sentences. For example, in the text span comprising 2 sentences ($\small t = 2$) in $\small \text{Listing 1}$ above, given cancer patients with mutation $\small v$ (EGFR) in gene $\small g$ (L858E), the patients showed a partial response to drug $\small d$ (gefitnib). Thus, a ternary relation response ( $\small \text{EGFR}$, $\small \text{L858E}$, $\small \text{gefitnib}$) exists among the three entities spanning across the two sentences in $\small \text{Listing 1}$.”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Likewise, Neural Relation Extraction Within and Across Sentence Boundaries (Oct 2018) proposed an architecture for relation extraction in entity pairs spanning multiple sentences: inter-sentential dependency-based neural networks (iDepNN). iDepNN modeled the shortest and augmented dependency paths via recurrent and recursive neural networks to extract relationships within (intra-) and across (inter-) sentence boundaries. Compared to SVM and neural network baselines, iDepNN was more robust to false positives in relationships spanning sentences. The authors evaluated their model on four datasets from newswire (MUC6) and medical (BioNLP shared tasks) domains, that achieved state of the art performance and showed a better balance in precision and recall for inter-sentential relationships – performing better than 11 teams participating in the BioNLP Shared Task 2016, achieving a gain of 5.2% (0.587 vs 0.558) in $\small F_1$ over the winning team. They also released the crosssentence annotations for MUC6.


[Image source. . Click image to open in new window.]


[Image source. . Click image to open in new window.]


[Image source. . Click image to open in new window.]

Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. They showed that their model was able to capture features and interactions (the model was robust in handling both overlapping and non-overlapping mentions) that could not be captured by previous models, while maintaining a low time complexity for inference.

In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (above), Learning to Recognize Discontiguous Entities (Oct 2018) focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.

Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. GraphIE: A Graph-Based Framework for Information Extraction (Oct 2018) [code] is a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagated information between connected nodes through graph convolutions and exploited the richer representation to improve word level predictions. Results on three different tasks – social media, textual and visual information extraction – showed that GraphIE outperformed a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

While character-based neural models have proven useful for many NLP tasks, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) first investigated the gaps between methods for learning word and sentence representations. Furthermore, they proposed IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. They evaluated their model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. IntNet significantly outperformed other character embedding models and obtained new state of the art performance without relying on any external knowledge or resources.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

State-of-the-art studies have demonstrated the superiority of joint modelling over pipeline implementation for medical named entity recognition and normalization due to the mutual benefits between the two processes. A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization (Dec 2018) [code  (empty repo, 2018-12-17)] proposed a novel deep neural multitask learning framework with explicit feedback strategies to jointly modeled recognition and normalization. Their method benefitted from the general representations of both tasks provided by multitask learning, and successfully converted hierarchical tasks into a parallel multitask setting while maintaining the mutual support between tasks. Their method performed significantly better than state of the art approaches on two publicly available medical literature datasets.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Natural Language Processing:

Additional Reading

  • Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016) [codediscussion]

    “Relation classification is an important semantic processing task in the field of natural language processing (NLP). State of the art systems still rely on lexical resources such as WordNet or NLP systems like dependency parser and named entity recognizers (NER) to get high-level features. Another challenge is that important information can appear at any position in the sentence. To tackle these problems, we propose Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM) to capture the most important semantic information in a sentence. The experimental results on the SemEval-2010 relation classification task show that our method outperforms most of the existing methods, with only word vectors.”


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Gradual Machine Learning for Entity Resolution (Oct 2018)

    “Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning [GML], which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.”


    [Image source. Click image to open in new window.]


    [Image sourceGML: gradual machine learning (this paper); UR: unsupervised rule-based; UC: unsupervised clustering; SVM: support vector machine; DNN: Deep Learning.  Click image to open in new window.]

    Our evaluation is conducted on three real datasets, which are described as follows:

    • DS (DBLP-Scholar 3): The DS dataset contains the publication entities from DBLP and the publication entities from Google Scholar. The experiments match the DBLP entries with the Scholar entries.
    • AB (Abt-Buy 4): The AB dataset contains the product entities from both and The experiments match the Abt entries with the Buy entries.
    • SG (Songs 5): The SG dataset contains song entities, some of which refer to the same songs. The experiments match the song entries in the same table.
  • Neural CRF Transducers for Sequence Labeling (Nov 2018)

    “Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.”


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Comparison of Named Entity Recognition Methodologies in Biomedical Documents (Nov 2018)

    “Background. Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers.
    Results. Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan-type and Elman-type algorithms have $\small F_1$ scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and word2vec have $\small F_1$ of 72.73%, 72.74%, and 72.82%, respectively.”

    “In this paper, we use five categories (protein, DNA, RNA, cell type, and cell line) instead of the categories used in the ordinary NER process. An example of the NER tagged sentence is as follows: ‘IL-2 [ B-protein ] responsiveness requires three distinct elements [ B-DNA ] within the enhancer [ B-DNA ].’


    [Image source. Click image to open in new window.]
  • Quantifying Uncertainties in Natural Language Processing Tasks (Nov 2018)

    “Reliable uncertainty quantification is a first step towards building explainable, transparent, and accountable artificial intelligent systems. Recent progress in Bayesian deep learning has made such quantification realizable. In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLP tasks.”

    “In this work, we evaluate the benefits of quantifying uncertainties in modern neural network models applied in the context of three different natural language processing tasks. We conduct experiments on sentiment analysis, named entity recognition, and language modeling tasks with convolutional and recurrent neural network models. We show that by quantifying both uncertainties, model performances are improved across the three tasks. We further investigate the characteristics of inputs with high and low data uncertainty measures in Yelp 2013 and CoNLL 2003 datasets. For both datasets, our model estimates higher data uncertainties for more difficult predictions.”


    [Image source. Click image to open in new window.]
  • Bayesian Compression for Natural Language Processing (Dec 2018) [code]

    “In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable”


    [Image source. Click image to open in new window.]

  • An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018) [code] proposed a neural network approach, an attention-based bidirectional long short-term memory with a conditional random field layer (Att-BiLSTM-CRF), to document-level chemical NER. The approach leveraged document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. Their method used word and character embeddings as basic features. In addition, to investigate the effects of traditional features for deep learning methods, POS, chunking and dictionary features were added into the models as additional features. Att-BiLSTM-CRF achieved better performance with little feature engineering than other state of the art methods on the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus and the BioCreative V chemical-disease relation (CDR) task corpus (F-scores of 91.14 and 92.57%, respectively).


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) [code] introduced LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilized a conditional random field (CRF) layer in conjunction with attention-based feature modeling. Their approach explored information about features that is modeled by means of an attention mechanism. LSTMVoter outperformed each individual extractor integrated into it. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieved an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieved an F1-score of 89.01%. [This model is very similar to but outperformed by the model described in Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition.]


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Invariant Information Clustering for Unsupervised Image Classification and Segmentation (University of Oxford: Jul 2018, updated Mar 2019) [codediscussion (reddit)]

    “We present a novel clustering objective that learns a neural network classifier from scratch, given only unlabelled data samples. The model discovers clusters that accurately match semantic classes, achieving state-of-the-art results in eight unsupervised clustering benchmarks spanning image classification and segmentation. These include STL10, an unsupervised variant of ImageNet, and CIFAR10, where we significantly beat the accuracy of our closest competitors by 8 and 9.5 absolute percentage points respectively. The method is not specialised to computer vision and operates on any paired dataset samples; in our experiments we use random transforms to obtain a pair from each image. The trained network directly outputs semantic labels, rather than high dimensional representations that need external processing to be usable for semantic clustering. The objective is simply to maximise mutual information between the class assignments of each pair. It is easy to implement and rigorously grounded in information theory, meaning we effortlessly avoid degenerate solutions that other clustering methods are susceptible to. In addition to the fully unsupervised mode, we also test two semi-supervised settings. The first achieves 88.8% accuracy on STL10 classification, setting a new global state-of-the-art over all existing methods (whether supervised, semi supervised or unsupervised). The second shows robustness to 90% reductions in label coverage, of relevance to applications that wish to make use of small amounts of labels.  [GitHub]”


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]

RMDL : Random Multimodel Deep Learning for Classification (May 2018) [code  |  discussion (reddit)]

  • “This paper introduces Random Multimodel Deep Learning  (RMDL ): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL  solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RMDL  can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL  and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RMDL  produces consistently better performance than standard methods over a broad range of data types and classification problems.”


[Image source. Click image to open in new window.]

Neural Vector Conceptualization for Word Vector Space Interpretation (Apr 2019) [code]

“Distributed word vector spaces are considered hard to interpret which hinders the understanding of natural language processing (NLP) models. In this work, we introduce a new method to interpret arbitrary samples from a word vector space. To this end, we train a neural model to conceptualize word vectors, which means that it activates higher order concepts it recognizes in a given vector. Contrary to prior approaches, our model operates in the original vector space and is capable of learning non-linear relations between word vectors and concepts. Furthermore, we show that it produces considerably less entropic concept activation profiles than the popular cosine similarity.”


[Image source. Click image to open in new window.]

Simple BERT  Models for Relation Extraction and Semantic Role Labeling (Apr 2019)

  • “We present simple BERT -based models for relation extraction and semantic role labeling. In recent years, state-of-the-art performance has been achieved using neural models by incorporating lexical and syntactic features such as part-of-speech tags and dependency trees. In this paper, extensive experiments on datasets for these two tasks show that without using any external features, a simple BERT -based model can achieve state-of-the-art performance. To our knowledge, we are the first to successfully apply BERT  in this manner. Our models provide strong baselines for future research.”

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents (2018) [codemention (reddit)]

  • “Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single,longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers [PubMed; arXiv] show that our model significantly outperforms state-of-the-art models.”

Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding (Apr 2019) established “strong baselines for event temporal relation extraction on two under-explored story narrative datasets… We demonstrate that neural network-based models can outperform some strong traditional linguistic feature-based models. We also conduct comparative studies to show the contribution of adopting contextualized word embeddings (BERT ) for event temporal relation extraction from stories….”

X-BERT : eXtreme Multi-label Text Classification with BERT  (May 2019) [“all the data split andsource codes will be made publicly available”]

  • “Extreme multi-label text classification (XMC) aims to tag each input text with the most relevant labels from an extremely large label set, such as those that arise in product categorization and e-commerce recommendation. Recently, pretrained language representation models such as BERT  achieve remarkable state-of-the-art performance across a wide range of NLP tasks including sentence classification among small label sets (typically fewer than thousands). Indeed, there are several challenges in applying BERT  to the XMC problem. The main challenges are: (i) the difficulty of capturing dependencies and correlations among labels, whose features may come from heterogeneous sources, and (ii) the tractability to scale to the extreme label setting as the model size can be very large and scale linearly with the size of the output space. To overcome these challenges, we propose X-BERT , the first feasible attempt to finetune BERT  models for a scalable solution to the XMC problem. Specifically, X-BERT  leverages both the label and document text to build label representations, which induces semantic label clusters in order to better model label dependencies. At the heart of X-BERT  is finetuning BERT  models to capture the contextual relations between input text and the induced label clusters. Finally, an ensemble of the different BERT  models trained on heterogeneous label clusters leads to our best final model. Empirically, on a Wiki dataset with around 0.5 million labels, X-BERT  achieves new state-of-the-art results where the precision@1 reaches 67:80%, a substantial improvement over 32.58%/60.91% of deep learning baseline fastText and competing XMC approach Parabel, respectively. This amounts to a 11.31% relative improvement over Parabel, which is indeed significant since the recent approach SLICE only has 5.53% relative improvement.”


[Image source. Click image to open in new window.]

Enhancing Clinical Concept Extraction with Contextual Embedding (May 2019)

  • “Neural network-based representations (“embeddings”) have dramatically advanced natural language processing (NLP) tasks, including clinical NLP tasks such as concept extraction. Recently, however, more advanced embedding methods and representations (e.g., ELMo, BERT) have further pushed the state-of-the-art in NLP, yet there are no common best practices for how to integrate these representations into clinical tasks. The purpose of this study, then, is to explore the space of possible options in utilizing these new models for clinical concept extraction, including comparing these to traditional word embedding methods (word2vec, GloVe, fastText). Both off-the-shelf, open-domain embeddings and pre-training clinical embeddings from MIMIC-III are evaluated. We explore a battery of embedding methods consisting of traditional word embeddings and contextual embeddings, and compare these on four concept extraction corpora: i2b2 2010, i2b2 2012, SemEval 2014, and SemEval 2015. We also analyze the impact of the pre-training time of a large language model like ELMo or BERT on the extraction performance. Finally, we present an intuitive way to understand the semantic information encoded by contextual embeddings. Contextual embeddings pre-trained on a large clinical corpus achieves new state-of-the-art performances across all concept extraction tasks. The best-performing model outperforms all state-of-the-art methods with respective $\small F_1$-measures of 90.25, 93.18 (partial), 80.74, and 81.65. We demonstrate the potential of contextual embeddings through the state-of-the-art performance these methods achieve on clinical concept extraction. Additionally, we demonstrate contextual embeddings encode valuable semantic information not accounted for in traditional word representations.”

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor (May 2019)

  • “Extracting key information from documents, such as receipts or invoices, and preserving the interested texts to structured data is crucial in the document-intensive streamline processes of office automation in areas that includes but not limited to accounting, financial, and taxation areas. To avoid designing expert rules for each specific type of document, some published works attempt to tackle the problem by learning a model to explore the semantic context in text sequences based on the Named Entity Recognition (NER) method in the NLP field. In this paper, we propose to harness the effective information from both semantic meaning and spatial distribution of texts in documents. Specifically, our proposed model, Convolutional Universal Text Information Extractors  (CUTIE ), applies convolutional neural networks on gridded texts where texts are embedded as features with semantical connotations. We further explore the effect of employing different structures of convolutional neural network and propose a fast and portable structure. We demonstrate the effectiveness of the proposed method on a dataset with up to 4,484 labelled receipts, without any pre-training or post-processing, achieving state of the art performance that is much higher than BERT but with only 1/10 parameters and without requiring the 3,300M word dataset for pre-training. Experimental results also demonstrate that the CUTIE  being able to achieve state of the art performance with much smaller amount of training data.”

Using Ontologies To Improve Performance In Massively Multi-label Prediction Models (Stanford | Google Brain: May 2019)

  • “Massively multi-label prediction/classification problems arise in environments like health-care or biology where very precise predictions are useful. One challenge with massively multi-label problems is that there is often a long-tailed frequency distribution for the labels, which results in few positive examples for the rare labels. We propose a solution to this problem by modifying the output layer of a neural network to create a Bayesian network of sigmoids which takes advantage of ontology relationships between the labels to help share information between the rare and the more common labels. We apply this method to the two massively multi-label tasks of disease prediction (ICD-9 codes) and protein function prediction (Gene Ontology terms) and obtain significant improvements in per-label AUROC and average precision for less common labels.”


[Image source. Click image to open in new window.]

Transforming Complex Sentences into a Semantic Hierarchy (Jun 2019) [code  |  Graphene projectcode]

  • “We present an approach for recursively splitting and rephrasing complex English sentences into a novel semantic hierarchy of simplified sentences, with each of them presenting a more regular structure that may facilitate a wide variety of artificial intelligence tasks, such as machine translation (MT) or information extraction (IE). … the proposed syntactic simplification approach outperforms the state of the art in structural text simplification. Moreover, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall. …”


    [Image source. Click image to open in new window.]

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Aug 2018) [project, code]

  • “We introduce a multi-task setup of identifying entities, relations, and coreference clusters in scientific articles. We create SCIERC, a dataset that includes annotations for all three tasks and develop a unified framework called SCIIE  with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.”


[Image source. Click image to open in new window.]

Document-Level $\small N$-ary Relation Extraction with Multiscale Representation Learning (Stanford University | Microsoft Research: Jun 2019) [code: “Our code and data will be available at”]  “Most information extraction methods focus on binary relations expressed within single sentences. In high-value domains, however, $\small n$-ary relations are of great demand (e.g., drug-gene-mutation interactions in precision oncology). Such relations often involve entity mentions that are far apart in the document, yet existing work on cross-sentence relation extraction is generally confined to small text spans (e.g., three consecutive sentences), which severely limits recall. In this paper, we propose a novel multiscale neural architecture for document-level $\small n$-ary relation extraction. Our system combines representations learned over various text spans throughout the document and across the subrelation hierarchy. Widening the system’s purview to the entire document maximizes potential recall. Moreover, by integrating weak signals across the document, multiscale modeling increases precision, even in the presence of noisy labels from distant supervision. Experiments on biomedical machine reading show that our approach substantially outperforms previous $\small n$-ary relation extraction methods.”


[Image source. Click image to open in new window.]

NSEEN : Neural Semantic Embedding for Entity Normalization (Jun 2019).  “Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. … we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.”


[Image source. Click image to open in new window.]

MinIE : Minimizing Facts in Open Information Extraction (2017) [code].  “The goal of Open Information Extraction (OIE) is to extract surface relations and their arguments from natural-language text in an unsupervised, domain-independent manner. In this paper, we propose MinIE , an OIE system that aims to provide useful, compact extractions with high precision and recall. MinIE  approaches these goals by (1) representing information about polarity, modality, attribution, and quantities with semantic annotations instead of in the actual extraction, and (2) identifying and removing parts that are considered overly specific. We conducted an experimental study with several real-world datasets and found that MinIE  achieves competitive or higher precision and recall than most prior systems, while at the same time producing shorter, semantically enriched extractions.”

  • MinIE  was the best performing IE system in: WiRe57 : A Fine-Grained Benchmark for Open Information Extraction (University of Montreal: Aug 2019) [code].  “We build a reference for the task of Open Information Extraction, on five documents. We tentatively resolve a number of issues that arise, including inference and granularity. We seek to better pinpoint the requirements for the task. We produce our annotation guidelines specifying what is correct to extract and what is not. In turn, we use this reference to score existing Open IE systems. We address the non-trivial problem of evaluating the extractions produced by systems against the reference tuples, and share our evaluation script. Among seven compared extractors, we find the MinIE  system to perform best.


    [Image source. Click image to open in new window.]

  • Hamming Sentence Embeddings for Information Retrieval (RheinMain University of Applied Sciences: Aug 2019) [code | results].  “In retrieval applications, binary hashes are known to offer significant improvements in terms of both memory and speed. We investigate the compression of sentence embeddings using a neural encoder-decoder architecture, which is trained by minimizing reconstruction error. Instead of employing the original real-valued embeddings, we use latent representations in Hamming space produced by the encoder for similarity calculations. In quantitative experiments on several benchmarks for semantic similarity tasks, we show that our compressed hamming embeddings yield a comparable performance to uncompressed embeddings (Sent2Vec , InferSent , Glove-BoW ), at compression ratios of up to 256:1. We further demonstrate that our model strongly decorrelates input features, and that the compressor generalizes well when pre-trained on Wikipedia sentences. We publish the source code on Github and all experimental results.”