Technical Review

Natural Language Processing

Last modified: 2019-06-24


Copyright notice, citation: Copyright
Source
© 2018-present, Victoria A. Stuart


These Contents


[Table of Contents]

NATURAL LANGUAGE PROCESSING

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. NLP is foundational to all information extraction and natural language tasks.

nlp_terms.png

[Image source (slide 8). Click image to open in new window.]


Recent reviews of NLP relevant to this TECHNICAL REVIEW include:

Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on arXiv.org, my RSS feeds, and other sources. I firmly believe that this rapid progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).

Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress  (alternate link).

Basic steps associated with NLP include text retrieval and preprocessing steps, including:

Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.

Some recent ML approaches to NLP tasks include:

Again, that is not an exhaustive list – merely some articles that I have recently encountered that are relevant to my interests.


[Table of Contents]

NLP: Selected Papers

Cross-sentence $\small n$-ary relation extraction detects relations among $\small n$ entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG [directed acyclic graph], adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM  [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given

“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”

… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state of the art system of Peng et al. (2017) by 1.2%.

Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified in regard to the edge labels (discussed by Song et al. in their paper).

arxiv-1808.09101a.png

[Image source. Click image to open in new window.]


arxiv-1808.09101b.png

[Image source. Click image to open in new window.]


/files/misc/arxiv-1808.09101c.png

[Image source. Click image to open in new window.]


arxiv-1808.09101d.png

[Image source. Click image to open in new window.]


Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction (Nov 2018) proposed a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploited word embeddings and positional embeddings for cross-sentence $\small n$-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture the most informative features, which are essential for cross-sentence $\small n$-ary relation extraction. Their LSTM-CNN model was evaluated on standard datasets for cross-sentence $\small n$-ary relation extraction, where it significantly outperformed baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also showed that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence $\small n$-ary relation extraction.

  • “However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in $\small \text{LISTING 1}$, there exists a ternary relation response across three entities: $\small \text{EGFR}$ , $\small \text{L858E}$ , $\small \text{gefitnib}$ appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence $\small n$-ary relation extraction.

      $\small \text{Listing 1: Text span of two consecutive sentences}$

      'The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10.   All patients were treated with gefitnib and showed a partial response. '

    “This paper focuses on the cross-sentence $\small n$-ary relation extraction task. Formally, let $\small \{e_1, \ldots ,e_n\}$ be the set of entities in a text span $\small S$ containing $\small t$ number of consecutive sentences. For example, in the text span comprising 2 sentences ($\small t = 2$) in $\small \text{Listing 1}$ above, given cancer patients with mutation $\small v$ (EGFR) in gene $\small g$ (L858E), the patients showed a partial response to drug $\small d$ (gefitnib). Thus, a ternary relation response ( $\small \text{EGFR}$, $\small \text{L858E}$, $\small \text{gefitnib}$) exists among the three entities spanning across the two sentences in $\small \text{Listing 1}$.”

arxiv1811.00845-f1.png

[Image source. Click image to open in new window.]


arxiv1811.00845-t4+t5.png

[Image source. Click image to open in new window.]


Likewise, Neural Relation Extraction Within and Across Sentence Boundaries (Oct 2018) proposed an architecture for relation extraction in entity pairs spanning multiple sentences: inter-sentential dependency-based neural networks (iDepNN). iDepNN modeled the shortest and augmented dependency paths via recurrent and recursive neural networks to extract relationships within (intra-) and across (inter-) sentence boundaries. Compared to SVM and neural network baselines, iDepNN was more robust to false positives in relationships spanning sentences. The authors evaluated their model on four datasets from newswire (MUC6) and medical (BioNLP shared tasks) domains, that achieved state of the art performance and showed a better balance in precision and recall for inter-sentential relationships – performing better than 11 teams participating in the BioNLP Shared Task 2016, achieving a gain of 5.2% (0.587 vs 0.558) in $\small F_1$ over the winning team. They also released the crosssentence annotations for MUC6.

arxiv1810.05102a.png

[Image source. . Click image to open in new window.]


arxiv1810.05102b.png

[Image source. . Click image to open in new window.]


arxiv1810.05102c.png

[Image source. . Click image to open in new window.]


Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. They showed that their model was able to capture features and interactions (the model was robust in handling both overlapping and non-overlapping mentions) that could not be captured by previous models, while maintaining a low time complexity for inference.

In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (above), Learning to Recognize Discontiguous Entities (Oct 2018) focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.

Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. GraphIE: A Graph-Based Framework for Information Extraction (Oct 2018) [code] is a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagated information between connected nodes through graph convolutions and exploited the richer representation to improve word level predictions. Results on three different tasks – social media, textual and visual information extraction – showed that GraphIE outperformed a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.

arxiv1810.13083-f1.png

[Image source. Click image to open in new window.]


arxiv1810.13083-f2.png

[Image source. Click image to open in new window.]


arxiv1810.13083-t5+t6.png

[Image source. Click image to open in new window.]


While character-based neural models have proven useful for many NLP tasks, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) first investigated the gaps between methods for learning word and sentence representations. Furthermore, they proposed IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. They evaluated their model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. IntNet significantly outperformed other character embedding models and obtained new state of the art performance without relying on any external knowledge or resources.

arxiv1810.12443-f1.png

[Image source. Click image to open in new window.]


arxiv1810.12443-t1.png

[Image source. Click image to open in new window.]


arxiv1810.12443-t2.png

[Image source. Click image to open in new window.]


arxiv1810.12443-t3+t4.png

[Image source. Click image to open in new window.]


State-of-the-art studies have demonstrated the superiority of joint modelling over pipeline implementation for medical named entity recognition and normalization due to the mutual benefits between the two processes. A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization (Dec 2018) [code  (empty repo, 2018-12-17)] proposed a novel deep neural multitask learning framework with explicit feedback strategies to jointly modeled recognition and normalization. Their method benefitted from the general representations of both tasks provided by multitask learning, and successfully converted hierarchical tasks into a parallel multitask setting while maintaining the mutual support between tasks. Their method performed significantly better than state of the art approaches on two publicly available medical literature datasets.

arxiv1812.06081-f1+f2+t1+t2+t4.png

[Image source. Click image to open in new window.]


arxiv1812.06081-f3.png

[Image source. Click image to open in new window.]


Natural Language Processing:

Additional Reading

  • Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016) [codediscussion]

    “Relation classification is an important semantic processing task in the field of natural language processing (NLP). State of the art systems still rely on lexical resources such as WordNet or NLP systems like dependency parser and named entity recognizers (NER) to get high-level features. Another challenge is that important information can appear at any position in the sentence. To tackle these problems, we propose Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM) to capture the most important semantic information in a sentence. The experimental results on the SemEval-2010 relation classification task show that our method outperforms most of the existing methods, with only word vectors.”

    Zhou2016attention-f1.png

    [Image source. Click image to open in new window.]

    Zhou2016attention-t1.png

    [Image source. Click image to open in new window.]

  • Gradual Machine Learning for Entity Resolution (Oct 2018)

    “Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning [GML], which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.”

    arxiv1810.12125-f1+f4.png

    [Image source. Click image to open in new window.]


    arxiv1810.12125-t3.png

    [Image sourceGML: gradual machine learning (this paper); UR: unsupervised rule-based; UC: unsupervised clustering; SVM: support vector machine; DNN: Deep Learning.  Click image to open in new window.]

    Our evaluation is conducted on three real datasets, which are described as follows:

    • DS (DBLP-Scholar 3): The DS dataset contains the publication entities from DBLP and the publication entities from Google Scholar. The experiments match the DBLP entries with the Scholar entries.
    • AB (Abt-Buy 4): The AB dataset contains the product entities from both Abt.com and Buy.com. The experiments match the Abt entries with the Buy entries.
    • SG (Songs 5): The SG dataset contains song entities, some of which refer to the same songs. The experiments match the song entries in the same table.
  • Neural CRF Transducers for Sequence Labeling (Nov 2018)

    “Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.”

    arxiv1811.01382-t1+f1.png

    [Image source. Click image to open in new window.]


    arxiv1811.01382-t2+t3+t4+t5.png

    [Image source. Click image to open in new window.]

  • Comparison of Named Entity Recognition Methodologies in Biomedical Documents (Nov 2018)

    “Background. Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers.
    Results. Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan-type and Elman-type algorithms have $\small F_1$ scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and word2vec have $\small F_1$ of 72.73%, 72.74%, and 72.82%, respectively.”

    “In this paper, we use five categories (protein, DNA, RNA, cell type, and cell line) instead of the categories used in the ordinary NER process. An example of the NER tagged sentence is as follows: ‘IL-2 [ B-protein ] responsiveness requires three distinct elements [ B-DNA ] within the enhancer [ B-DNA ].’

    PMID30396340-f1+f4+f5+t2.png

    [Image source. Click image to open in new window.]
  • Quantifying Uncertainties in Natural Language Processing Tasks (Nov 2018)

    “Reliable uncertainty quantification is a first step towards building explainable, transparent, and accountable artificial intelligent systems. Recent progress in Bayesian deep learning has made such quantification realizable. In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLP tasks.”

    “In this work, we evaluate the benefits of quantifying uncertainties in modern neural network models applied in the context of three different natural language processing tasks. We conduct experiments on sentiment analysis, named entity recognition, and language modeling tasks with convolutional and recurrent neural network models. We show that by quantifying both uncertainties, model performances are improved across the three tasks. We further investigate the characteristics of inputs with high and low data uncertainty measures in Yelp 2013 and CoNLL 2003 datasets. For both datasets, our model estimates higher data uncertainties for more difficult predictions.”

    arxiv1811.07253-t3+t4.png

    [Image source. Click image to open in new window.]
  • Bayesian Compression for Natural Language Processing (Dec 2018) [code]

    “In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable”

    arxiv1810.10927-t1+t2.png

    [Image source. Click image to open in new window.]

  • An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018) [code] proposed a neural network approach, an attention-based bidirectional long short-term memory with a conditional random field layer (Att-BiLSTM-CRF), to document-level chemical NER. The approach leveraged document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. Their method used word and character embeddings as basic features. In addition, to investigate the effects of traditional features for deep learning methods, POS, chunking and dictionary features were added into the models as additional features. Att-BiLSTM-CRF achieved better performance with little feature engineering than other state of the art methods on the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus and the BioCreative V chemical-disease relation (CDR) task corpus (F-scores of 91.14 and 92.57%, respectively).

    PMID29186323-f1.png

    [Image source. Click image to open in new window.]


    PMID29186323-t4+t5.png

    [Image source. Click image to open in new window.]


  • LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) [code] introduced LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilized a conditional random field (CRF) layer in conjunction with attention-based feature modeling. Their approach explored information about features that is modeled by means of an attention mechanism. LSTMVoter outperformed each individual extractor integrated into it. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieved an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieved an F1-score of 89.01%. [This model is very similar to but outperformed by the model described in Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition.]

    PMID30631966-f1.png

    [Image source. Click image to open in new window.]


    PMID30631966-f3+t2.png

    [Image source. Click image to open in new window.]


  • Invariant Information Clustering for Unsupervised Image Classification and Segmentation (University of Oxford: Jul 2018, updated Mar 2019) [codediscussion (reddit)]

    “We present a novel clustering objective that learns a neural network classifier from scratch, given only unlabelled data samples. The model discovers clusters that accurately match semantic classes, achieving state-of-the-art results in eight unsupervised clustering benchmarks spanning image classification and segmentation. These include STL10, an unsupervised variant of ImageNet, and CIFAR10, where we significantly beat the accuracy of our closest competitors by 8 and 9.5 absolute percentage points respectively. The method is not specialised to computer vision and operates on any paired dataset samples; in our experiments we use random transforms to obtain a pair from each image. The trained network directly outputs semantic labels, rather than high dimensional representations that need external processing to be usable for semantic clustering. The objective is simply to maximise mutual information between the class assignments of each pair. It is easy to implement and rigorously grounded in information theory, meaning we effortlessly avoid degenerate solutions that other clustering methods are susceptible to. In addition to the fully unsupervised mode, we also test two semi-supervised settings. The first achieves 88.8% accuracy on STL10 classification, setting a new global state-of-the-art over all existing methods (whether supervised, semi supervised or unsupervised). The second shows robustness to 90% reductions in label coverage, of relevance to applications that wish to make use of small amounts of labels.  [GitHub]”

    arxiv1807.06653-f1+f2+f4.png

    [Image source. Click image to open in new window.]


    arxiv1807.06653-f3+f5.png

    [Image source. Click image to open in new window.]

RMDL : Random Multimodel Deep Learning for Classification (May 2018) [code  |  discussion (reddit)]

  • “This paper introduces Random Multimodel Deep Learning  (RMDL ): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL  solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RMDL  can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL  and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RMDL  produces consistently better performance than standard methods over a broad range of data types and classification problems.”

arxiv1805.01890-f1+f2+t1+t3.png

[Image source. Click image to open in new window.]

Neural Vector Conceptualization for Word Vector Space Interpretation (Apr 2019) [code]

“Distributed word vector spaces are considered hard to interpret which hinders the understanding of natural language processing (NLP) models. In this work, we introduce a new method to interpret arbitrary samples from a word vector space. To this end, we train a neural model to conceptualize word vectors, which means that it activates higher order concepts it recognizes in a given vector. Contrary to prior approaches, our model operates in the original vector space and is capable of learning non-linear relations between word vectors and concepts. Furthermore, we show that it produces considerably less entropic concept activation profiles than the popular cosine similarity.”

arxiv1904.01500-f1+f2.png

[Image source. Click image to open in new window.]

Simple BERT  Models for Relation Extraction and Semantic Role Labeling (Apr 2019)

  • “We present simple BERT -based models for relation extraction and semantic role labeling. In recent years, state-of-the-art performance has been achieved using neural models by incorporating lexical and syntactic features such as part-of-speech tags and dependency trees. In this paper, extensive experiments on datasets for these two tasks show that without using any external features, a simple BERT -based model can achieve state-of-the-art performance. To our knowledge, we are the first to successfully apply BERT  in this manner. Our models provide strong baselines for future research.”

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents (2018) [codemention (reddit)]

  • “Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single,longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers [PubMed; arXiv] show that our model significantly outperforms state-of-the-art models.”

Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding (Apr 2019) established “strong baselines for event temporal relation extraction on two under-explored story narrative datasets… We demonstrate that neural network-based models can outperform some strong traditional linguistic feature-based models. We also conduct comparative studies to show the contribution of adopting contextualized word embeddings (BERT ) for event temporal relation extraction from stories….”

A Modular Deep Learning Approach for Extreme Multi-label Text Classification (May 2019) [“all the data split andsource codes will be made publicly available”]

  • Extreme multi-label classification (XMC) aims to assign to an instance the most relevant subset of labels from a colossal label set. Due to modern applications that lead to massive label sets, the scalability of XMC has attracted much recent attention from both academia and industry. In this paper, we establish a three-stage framework to solve XMC efficiently, which includes 1) indexing the labels, 2) matching the instance to the relevant indices, and 3) ranking the labels from the relevant indices. This framework unifies many existing XMC approaches. Based on this framework, we propose a modular deep learning approach SLINMER: Semantic Label Indexing, Neural Matching, and Efficient Ranking. The label indexing stage of SLINMER  can adopt different semantic label representations leading to different configurations of SLINMER . Empirically, we demonstrate that several individual configurations of SLINMER  achieve superior performance than the state-of-the-art XMC approaches on several benchmark datasets. Moreover, by ensembling those configurations, SLINMER  can achieve even better results. In particular, on a Wiki dataset with around 0.5 millions of labels, the precision@1 is increased from 61% to 67%.”

    1905.02331-f1+f2+t2.png

[Image source. Click image to open in new window.]

Enhancing Clinical Concept Extraction with Contextual Embedding (May 2019)

  • “Neural network-based representations (“embeddings”) have dramatically advanced natural language processing (NLP) tasks, including clinical NLP tasks such as concept extraction. Recently, however, more advanced embedding methods and representations (e.g., ELMo, BERT) have further pushed the state-of-the-art in NLP, yet there are no common best practices for how to integrate these representations into clinical tasks. The purpose of this study, then, is to explore the space of possible options in utilizing these new models for clinical concept extraction, including comparing these to traditional word embedding methods (word2vec, GloVe, fastText). Both off-the-shelf, open-domain embeddings and pre-training clinical embeddings from MIMIC-III are evaluated. We explore a battery of embedding methods consisting of traditional word embeddings and contextual embeddings, and compare these on four concept extraction corpora: i2b2 2010, i2b2 2012, SemEval 2014, and SemEval 2015. We also analyze the impact of the pre-training time of a large language model like ELMo or BERT on the extraction performance. Finally, we present an intuitive way to understand the semantic information encoded by contextual embeddings. Contextual embeddings pre-trained on a large clinical corpus achieves new state-of-the-art performances across all concept extraction tasks. The best-performing model outperforms all state-of-the-art methods with respective $\small F_1$-measures of 90.25, 93.18 (partial), 80.74, and 81.65. We demonstrate the potential of contextual embeddings through the state-of-the-art performance these methods achieve on clinical concept extraction. Additionally, we demonstrate contextual embeddings encode valuable semantic information not accounted for in traditional word representations.”

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor (May 2019)

  • “Extracting key information from documents, such as receipts or invoices, and preserving the interested texts to structured data is crucial in the document-intensive streamline processes of office automation in areas that includes but not limited to accounting, financial, and taxation areas. To avoid designing expert rules for each specific type of document, some published works attempt to tackle the problem by learning a model to explore the semantic context in text sequences based on the Named Entity Recognition (NER) method in the NLP field. In this paper, we propose to harness the effective information from both semantic meaning and spatial distribution of texts in documents. Specifically, our proposed model, Convolutional Universal Text Information Extractors  (CUTIE ), applies convolutional neural networks on gridded texts where texts are embedded as features with semantical connotations. We further explore the effect of employing different structures of convolutional neural network and propose a fast and portable structure. We demonstrate the effectiveness of the proposed method on a dataset with up to 4,484 labelled receipts, without any pre-training or post-processing, achieving state of the art performance that is much higher than BERT but with only 1/10 parameters and without requiring the 3,300M word dataset for pre-training. Experimental results also demonstrate that the CUTIE  being able to achieve state of the art performance with much smaller amount of training data.”

Using Ontologies To Improve Performance In Massively Multi-label Prediction Models (Stanford | Google Brain: May 2019)

  • “Massively multi-label prediction/classification problems arise in environments like health-care or biology where very precise predictions are useful. One challenge with massively multi-label problems is that there is often a long-tailed frequency distribution for the labels, which results in few positive examples for the rare labels. We propose a solution to this problem by modifying the output layer of a neural network to create a Bayesian network of sigmoids which takes advantage of ontology relationships between the labels to help share information between the rare and the more common labels. We apply this method to the two massively multi-label tasks of disease prediction (ICD-9 codes) and protein function prediction (Gene Ontology terms) and obtain significant improvements in per-label AUROC and average precision for less common labels.”

    1905.12126-f1.png

[Image source. Click image to open in new window.]

Transforming Complex Sentences into a Semantic Hierarchy (Jun 2019) [code  |  Graphene projectcode]

  • “We present an approach for recursively splitting and rephrasing complex English sentences into a novel semantic hierarchy of simplified sentences, with each of them presenting a more regular structure that may facilitate a wide variety of artificial intelligence tasks, such as machine translation (MT) or information extraction (IE). … the proposed syntactic simplification approach outperforms the state of the art in structural text simplification. Moreover, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall. …”

    1906.01038-f4+t5.png

    [Image source. Click image to open in new window.]

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Aug 2018) [project, code]

  • “We introduce a multi-task setup of identifying entities, relations, and coreference clusters in scientific articles. We create SCIERC, a dataset that includes annotations for all three tasks and develop a unified framework called SCIIE  with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.”

1808.09602.png

[Image source. Click image to open in new window.]