Technical Review
Last modified: 2019-08-16
Copyright notice, citation: Copyright
Source © 2018-present, Victoria A. Stuart
These Contents
NATURAL LANGUAGE PROCESSING
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. NLP is foundational to all information extraction and natural language tasks.
[Image source (slide 8). Click image to open in new window.]Recent reviews of NLP relevant to this TECHNICAL REVIEW include:
- Natural Language Processing for Information Extraction (Jul 2018);
- Event Extraction for Systems Biology by Text Mining the Literature (Jul 2010);
- Natural Language Processing: An Introduction (Sep-Oct 2011);
- A Biomedical Information Extraction Primer for NLP Researchers (May 2017);
- Advances in Natural Language Processing (Jul 2017); and,
- Recent Trends in Deep Learning Based Natural Language Processing (Aug 2018).
Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on arXiv.org, my RSS feeds, and other sources. I firmly believe that this rapid progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).
Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress (alternate link).
Basic steps associated with NLP include text retrieval and preprocessing steps, including:
- sentence splitting
- tokenization
- named entity recognition
- addressing polysemy and named entity disambiguation (e.g. “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”)
- word sense disambiguation
- event extraction
- part-of-speech tagging
- syntactic/dependency parsing
- for background on the preceding steps, see
- relation extraction
- for background on relation extraction, see
- basic clustering approaches
- see (e.g.) the support vector machine and conditional random field descriptions in
Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.
Some recent ML approaches to NLP tasks include:
- Argumentation:
- Attention:
- Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016)
- An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018) [code]
- LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) (also an attention-based BiLSTM-CRF model) [code]
- Attention? Attention! (Jun 2018) [local copy]
- Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation (Oct 2018)
- Biomedical Event Extraction Based on GRU Integrating Attention Mechanism (2018)
- Linguistically-Informed Self-Attention for Semantic Role Labeling (2018) [code; discussion]
- Position-aware Self-attention with Relative Positional Encodings for Slot Filling (Jul 2018)
- Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (Aug 2018)
- Improving Distant Supervision with Maxpooled Attention and Sentence-Level Supervision (Oct 2018)
- [Karin Verspoor] End-To-End Neural Relation Extraction Using Deep Biaffine Attention (Dec 2018) [code: not present, 2019-01-01]
- Semantic Relation Classification via Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing (Jan 2019) [code]
- Hierarchical Attentional Hybrid Neural Networks for Document Classification (Jan 2019) [code]
- Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing (Feb 2109)
- Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016)
- Classification:
- RMDL : Random Multimodel Deep Learning for Classification (May 2018) [code]
- A Deep Neural Network Sentence Level Classification Method with Context Information
- A Comparative Study of Neural Network Models for Sentence Classification
- Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings
- Comparative Document Summarisation via Classification
- Hierarchical Attentional Hybrid Neural Networks for Document Classification (Jan 2019) [code]
- Invariant Information Clustering for Unsupervised Image Classification and Segmentation (University of Oxford: Jul 2018, updated Mar 2019) [code; discussion (reddit)]
- Neural Vector Conceptualization for Word Vector Space Interpretation (Apr 2019) [code]
- X-BERT : eXtreme Multi-label Text Classification with BERT (May 2019)
- Classification – label prediction:
- Clustering:
- Constituency parsing:
- Coreference resolution:
- Disambiguation:
- Embeddings:
- Entity linking and Text grounding
- Entity Linking for Biomedical Literature (May 2015) [code]
- REACH, described in Large-scale Automated Machine Reading Discovers New Cancer Driving-Mechanisms (2017)
- Unsupervised Medical Entity Recognition and Linking in Chinese Online Medical Text (Apr 2018)
- FamPlex: A Resource for Entity Recognition and Relationship Resolution of Human Protein Families and Complexes in Biomedical Text Mining (Jun 2018)
- EARL: Joint Entity and Relation Linking for Question Answering Over Knowledge Graphs (Jun 2018) [code]
- Event detection/extraction:
- Joint Extraction of Events and Entities within a Document Context
- Semi-Supervised Event Extraction with Paraphrase Clusters
- Event Detection with Neural Networks: A Rigorous Empirical Evaluation
- Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation (Oct 2018) [code]
- Biomedical Event Extraction Based on GRU Integrating Attention Mechanism (2018)
- One for All: Neural Joint Modeling of Entities and Events [image]
Source - Bidirectional Long Short-Term Memory with CRF for Detecting Biomedical Event Trigger in FastText Semantic Space (Dec 2018) [image]
Source - Context Awareness and Embedding for Biomedical Event Extraction (May 2019) [code]
- Named entity recognition:
- Joint Extraction of Events and Entities within a Document Context
- Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition (Jul 2017) [code; corpora]
- Label-aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition (Apr 2018) [code (n/a, Dec 2018)]
- GRAM-CNN: A Deep Learning Approach with Local Context for Named Entity Recognition in Biomedical Text (May 2018) [code]
- Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition (Aug 2018); see also:
- An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018) [code]
- LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) (also an attention-based BiLSTM-CRF model) [code]
- Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition
- Learning Named Entity Tagger using Domain-Specific Dictionary [code]
- Wide-scope Biomedical Named Entity Recognition and Normalization with CRFs, Fuzzy Matching and Character level Modeling
- Neural Entity Reasoner for Global Consistency in Named Entity Recognition
- A Byte-sized Approach to Named Entity Recognition [code
- CollaboNet: Collaboration of Deep Neural Networks for Biomedical Named Entity Recognition [code]
- Neural Segmental Hypergraphs for Overlapping Mention Recognition
- Learning to Recognize Discontiguous Entities
- Neural Adaptation Layers for Cross-domain Named Entity Recognition
- Named Entity Analysis and Extraction with Uncommon Words
- An Instance Transfer based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition
- Clinical Concept Extraction with Contextual Word Embedding [code]
- GraphIE: A Graph-Based Framework for Information Extraction
- Learning Better Internal Structure of Words for Sequence Labeling
- Gradual Machine Learning for Entity Resolution
- Comparison of Named Entity Recognition Methodologies in Biomedical Documents
- Few-shot Learning for Named Entity Recognition in Medical Text
- Unnamed Entity Recognition of Sense Mentions
- MER: A Shell Script and Annotation Server for Minimal Named Entity Recognition and Linking (Dec 2018) [code | demo]
- End-to-end Joint Entity Extraction and Negation Detection for Clinical Text (Dec 2018)
- Dynamic Transfer Learning for Named Entity Recognition (Dec 2018)
- A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization (Dec 2018)
- A Neural Network Approach to Chemical and Gene/Protein Entity Recognition in Patents (Dec 2018)
- Chemlistem: Chemical Named Entity Recognition using Recurrent Neural Networks (Dec 2018) [code]
- Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning (Jan 2018; updated Oct 2018) [code; updated code]
- A Survey on Deep Learning for Named Entity Recognition (Dec 2018)
- Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model (May 2019)
- CUTIE : Learning to Understand Documents with Convolutional Universal Text Information Extractor
- Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches (Jun 2019)
- Low-resource Deep Entity Resolution with Transfer and Active Learning (Jun 2019)
- A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy (Google Research: Jun 2019)
- Multi-Grained Named Entity Recognition (Jun 2019)
- Next Generation Community Assessment of Biomedical Entity Recognition Web Servers: Metrics, Performance, Interoperability Aspects of Becalm (Jun 2019)
- Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings (Jul 2019)
- Polysemy, hypernymy:
- POS tagging:
- Improving part-of-speech tagging via multi-task learning and character-level word representations
- [Karin Verspoor] An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing (Aug 2018) [code]
- Updates [Karin Verspoor] A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing (Jun 2017)
- Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018)
- [Karin Verspoor] From POS Tagging to Dependency Parsing for Biomedical Event Extraction (Jan 2019) [code]
- Question answering:
- Relation classification:
- CNN-based relation classification (extraction):
- Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016)
- Semantic Relation Classification via Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing (Jan 2019) [code]
- Enriching Pre-trained Language Model with Entity Information for Relation Classification (May 2019)
- Knowledge-Guided Convolutional Networks for Chemical-Disease Relation Extraction (May 2019) [code]
- Relation extraction (Information Extraction):
- [GitHub; curated list] Awesome Relation Extraction [discussion]
- Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction (Jan 2018) [code]
- Kernelized Hashcode Representations for Biomedical Relation Extraction
- Scientific Relation Extraction with Selectively Incorporated Concept Embeddings
- Cross-Sentence N-ary Relation Extraction with Graph LSTMs
- Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction
- Neural Relation Extraction Within and Across Sentence Boundaries
- N-ary Relation Extraction using Graph State LSTM
- Graphene: A Context-Preserving Open Information Extraction System
- Position-aware Self-attention with Relative Positional Encodings for Slot Filling (Jul 2018)
- A document level neural model integrated domain knowledge for chemical-induced disease relations
- [Karin Verspoor] Convolutional Neural Networks for Chemical-Disease Relation Extraction Are Improved with Character-Based Word Embeddings (May 2018)
- Chemical-induced disease relation extraction with dependency information and prior knowledge
- An End-to-end Deep Learning Architecture for Extracting Protein-protein Interactions Affected by Genetic Mutations [code]
- Relation Extraction with Weakly Supervised Learning Based on Process-structure-property-performance Reciprocity
- Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Aug 2018) [project, code]
- Graph Convolution over Pruned Dependency Trees Improves Relation Extraction
- Document Triage for Identifying Protein-Protein Interactions Affected by Mutations: A Neural Network Ensemble Approach [code]
- LPTK: A Linguistic Pattern-Aware Dependency Tree Kernel Approach for the BioCreative VI CHEMPROT Task
- Improving Distant Supervision with Maxpooled Attention and Sentence-Level Supervision (Oct 2018)
- A Hierarchical Framework for Relation Extraction with Reinforcement Learning
- Parser Extraction of Triples in Unstructured Text
- RESIDE: Improving Distantly-Supervised Neural Relation Extraction Using Side Information (Dec 2018)
- [Karin Verspoor] End-To-End Neural Relation Extraction Using Deep Biaffine Attention (Dec 2018) [code: not present, 2019-01-01]
- Extracting Chemical-Protein Interactions from Literature Using Sentence Structure Analysis and Feature Engineering (Jan 2019) [code]
- Exploring Semi-supervised Variational Autoencoders for Biomedical Relation Extraction (Jan 2019)
- Overview of the BioCreative VI Precision Medicine Track: Mining Protein Interactions and Mutations for Precision Medicine (Jan 2019)
- Simple BERT Models for Relation Extraction and Semantic Role Labeling (Apr 2019)
- Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding (Apr 2019)
- Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction (May 2019)
- Transforming Complex Sentences into a Semantic Hierarchy (Jun 2019) [code | Graphene project, code]
- Improving Relation Extraction by Pre-trained Language Representations (Jun 2019) [code]
- DocRED: A Large-Scale Document-Level Relation Extraction Dataset (Jun 2019) [code]
- Attention Guided Graph Convolutional Networks for Relation Extraction (Jun 2019) [code]
- Reflex : Flexible Framework for Relation Extraction in Multiple Domains (MIT CSAIL: Jun 2019) [code]
- Exploiting Entity BIO Tag Embeddings and Multi-task Learning for Relation Extraction with Imbalanced Data (Jun 2019)
- Document-Level $\small N$-ary Relation Extraction with Multiscale Representation Learning (Stanford University | Microsoft Research: Jun 2019)
- WiRe57 : A Fine-Grained Benchmark for Open Information Extraction
- Best performance: MinIE
- Semantic parsing:
- Semantic role labeling:
- I Know What You Want: Semantic Learning for Text Comprehension
- Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling [code]
- A Span Selection Model for Semantic Role Labeling [code]
- Linguistically-Informed Self-Attention for Semantic Role Labeling (2018) [code; discussion]
- Simple BERT Models for Relation Extraction and Semantic Role Labeling (Apr 2019)
- Summarization:
- Generating Wikipedia by Summarizing Long Sequences
- Unsupervised Neural Multi-document Abstractive Summarization
- Deep Transfer Reinforcement Learning for Text Summarization
- Content Selection in Deep Learning Models of Summarization
- Abstractive Summarization of Reddit Posts with Multi-level Memory Networks
- Extractive Summary as Discrete Latent Variables
- Abstractive Text Summarization by Incorporating Reader Comments (Dec 2018)
- Rotational Unit of Memory: A Novel Representation Unit for RNNs with Scalable Applications (MIT: 2019) [code]
- A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents (2018) [code]
- Sample Efficient Text Summarization Using a Single Pre-Trained Transformer (May 2019) [code]
- Tagging:
- Structured Multi-Label Biomedical Text Tagging via Attentive Neural Tree Decoding [code: not available, 2018-10-04; image
Source]
- Structured Multi-Label Biomedical Text Tagging via Attentive Neural Tree Decoding [code: not available, 2018-10-04; image
- Word sense disambiguation:
- Knowledge-based Word Sense Disambiguation using Topic Models
- Learning Graph Embeddings from WordNet-based Similarity Measures
- Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs
- Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (Aug 2018)
- A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors [code; discussion]
Again, that is not an exhaustive list – merely some articles that I have recently encountered that are relevant to my interests.
NLP: Selected Papers
Cross-sentence $\small n$-ary relation extraction detects relations among $\small n$ entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG [directed acyclic graph], adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given
“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”
… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state of the art system of Peng et al. (2017) by 1.2%.
[Image source. Click image to open in new window.]Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified in regard to the edge labels (discussed by Song et al. in their paper).
Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction (Nov 2018) proposed a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploited word embeddings and positional embeddings for cross-sentence $\small n$-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture the most informative features, which are essential for cross-sentence $\small n$-ary relation extraction. Their LSTM-CNN model was evaluated on standard datasets for cross-sentence $\small n$-ary relation extraction, where it significantly outperformed baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also showed that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence $\small n$-ary relation extraction.
-
“However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in $\small \text{LISTING 1}$, there exists a ternary relation response across three entities: $\small \text{EGFR}$ , $\small \text{L858E}$ , $\small \text{gefitnib}$ appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence $\small n$-ary relation extraction.
-
$\small \text{Listing 1: Text span of two consecutive sentences}$
'The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10. All patients were treated with gefitnib and showed a partial response. '
“This paper focuses on the cross-sentence $\small n$-ary relation extraction task. Formally, let $\small \{e_1, \ldots ,e_n\}$ be the set of entities in a text span $\small S$ containing $\small t$ number of consecutive sentences. For example, in the text span comprising 2 sentences ($\small t = 2$) in $\small \text{Listing 1}$ above, given cancer patients with mutation $\small v$ (EGFR) in gene $\small g$ (L858E), the patients showed a partial response to drug $\small d$ (gefitnib). Thus, a ternary relation response ( $\small \text{EGFR}$, $\small \text{L858E}$, $\small \text{gefitnib}$) exists among the three entities spanning across the two sentences in $\small \text{Listing 1}$.”
Likewise, Neural Relation Extraction Within and Across Sentence Boundaries (Oct 2018) proposed an architecture for relation extraction in entity pairs spanning multiple sentences: inter-sentential dependency-based neural networks (iDepNN). iDepNN modeled the shortest and augmented dependency paths via recurrent and recursive neural networks to extract relationships within (intra-) and across (inter-) sentence boundaries. Compared to SVM and neural network baselines, iDepNN was more robust to false positives in relationships spanning sentences. The authors evaluated their model on four datasets from newswire (MUC6) and medical (BioNLP shared tasks) domains, that achieved state of the art performance and showed a better balance in precision and recall for inter-sentential relationships – performing better than 11 teams participating in the BioNLP Shared Task 2016, achieving a gain of 5.2% (0.587 vs 0.558) in $\small F_1$ over the winning team. They also released the crosssentence annotations for MUC6.
[Image source. . Click image to open in new window.]Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. They showed that their model was able to capture features and interactions (the model was robust in handling both overlapping and non-overlapping mentions) that could not be captured by previous models, while maintaining a low time complexity for inference.
In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (above), Learning to Recognize Discontiguous Entities (Oct 2018) focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.
Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. GraphIE: A Graph-Based Framework for Information Extraction (Oct 2018) [code] is a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagated information between connected nodes through graph convolutions and exploited the richer representation to improve word level predictions. Results on three different tasks – social media, textual and visual information extraction – showed that GraphIE outperformed a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.
[Image source. Click image to open in new window.]While character-based neural models have proven useful for many NLP tasks, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) first investigated the gaps between methods for learning word and sentence representations. Furthermore, they proposed IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. They evaluated their model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. IntNet significantly outperformed other character embedding models and obtained new state of the art performance without relying on any external knowledge or resources.
[Image source. Click image to open in new window.]State-of-the-art studies have demonstrated the superiority of joint modelling over pipeline implementation for medical named entity recognition and normalization due to the mutual benefits between the two processes. A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization (Dec 2018) [code (empty repo, 2018-12-17)] proposed a novel deep neural multitask learning framework with explicit feedback strategies to jointly modeled recognition and normalization. Their method benefitted from the general representations of both tasks provided by multitask learning, and successfully converted hierarchical tasks into a parallel multitask setting while maintaining the mutual support between tasks. Their method performed significantly better than state of the art approaches on two publicly available medical literature datasets.
[Image source. Click image to open in new window.]Additional Reading
-
Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (2016) [code; discussion]
“Relation classification is an important semantic processing task in the field of natural language processing (NLP). State of the art systems still rely on lexical resources such as WordNet or NLP systems like dependency parser and named entity recognizers (NER) to get high-level features. Another challenge is that important information can appear at any position in the sentence. To tackle these problems, we propose Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM) to capture the most important semantic information in a sentence. The experimental results on the SemEval-2010 relation classification task show that our method outperforms most of the existing methods, with only word vectors.”
[Image source. Click image to open in new window.] [Image source. Click image to open in new window.]
-
Gradual Machine Learning for Entity Resolution (Oct 2018)
“Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning [GML], which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.”
[Image source. Click image to open in new window.]
[Image source. GML: gradual machine learning (this paper); UR: unsupervised rule-based; UC: unsupervised clustering; SVM: support vector machine; DNN: Deep Learning. Click image to open in new window.]Our evaluation is conducted on three real datasets, which are described as follows:
- DS (DBLP-Scholar 3): The DS dataset contains the publication entities from DBLP and the publication entities from Google Scholar. The experiments match the DBLP entries with the Scholar entries.
- AB (Abt-Buy 4): The AB dataset contains the product entities from both Abt.com and Buy.com. The experiments match the Abt entries with the Buy entries.
- SG (Songs 5): The SG dataset contains song entities, some of which refer to the same songs. The experiments match the song entries in the same table.
-
Neural CRF Transducers for Sequence Labeling (Nov 2018)
“Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
-
Comparison of Named Entity Recognition Methodologies in Biomedical Documents (Nov 2018)
“Background. Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers.
Results. Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan-type and Elman-type algorithms have $\small F_1$ scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and word2vec have $\small F_1$ of 72.73%, 72.74%, and 72.82%, respectively.”“In this paper, we use five categories (protein, DNA, RNA, cell type, and cell line) instead of the categories used in the ordinary NER process. An example of the NER tagged sentence is as follows: ‘IL-2 [ B-protein ] responsiveness requires three distinct elements [ B-DNA ] within the enhancer [ B-DNA ].’
[Image source. Click image to open in new window.] -
Quantifying Uncertainties in Natural Language Processing Tasks (Nov 2018)
“Reliable uncertainty quantification is a first step towards building explainable, transparent, and accountable artificial intelligent systems. Recent progress in Bayesian deep learning has made such quantification realizable. In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLP tasks.”
“In this work, we evaluate the benefits of quantifying uncertainties in modern neural network models applied in the context of three different natural language processing tasks. We conduct experiments on sentiment analysis, named entity recognition, and language modeling tasks with convolutional and recurrent neural network models. We show that by quantifying both uncertainties, model performances are improved across the three tasks. We further investigate the characteristics of inputs with high and low data uncertainty measures in Yelp 2013 and CoNLL 2003 datasets. For both datasets, our model estimates higher data uncertainties for more difficult predictions.”
[Image source. Click image to open in new window.] -
Bayesian Compression for Natural Language Processing (Dec 2018) [code]
“In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable”
[Image source. Click image to open in new window.]
-
An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018) [code] proposed a neural network approach, an attention-based bidirectional long short-term memory with a conditional random field layer (Att-BiLSTM-CRF), to document-level chemical NER. The approach leveraged document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. Their method used word and character embeddings as basic features. In addition, to investigate the effects of traditional features for deep learning methods, POS, chunking and dictionary features were added into the models as additional features. Att-BiLSTM-CRF achieved better performance with little feature engineering than other state of the art methods on the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus and the BioCreative V chemical-disease relation (CDR) task corpus (F-scores of 91.14 and 92.57%, respectively).
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]- This work was evaluated in Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition (Aug 2018).
-
Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition (Aug 2018) compared the use of LSTM based and CNN based character level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus showed that the use of either type of character level word embeddings in conjunction with the BiLSTM-CRF models led to comparable state of the art performance. However, the models using CNN based character level word embeddings had a computational performance advantage, increasing training time over word based models by 25% while the LSTM based character level word embeddings more than doubled the required training time.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]-
See also:
- An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018)
- LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) (also an attention-based BiLSTM-CRF model)
-
-
LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) [code] introduced LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilized a conditional random field (CRF) layer in conjunction with attention-based feature modeling. Their approach explored information about features that is modeled by means of an attention mechanism. LSTMVoter outperformed each individual extractor integrated into it. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieved an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieved an F1-score of 89.01%. [This model is very similar to but outperformed by the model described in Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
-
Invariant Information Clustering for Unsupervised Image Classification and Segmentation (University of Oxford: Jul 2018, updated Mar 2019) [code; discussion (reddit)]
“We present a novel clustering objective that learns a neural network classifier from scratch, given only unlabelled data samples. The model discovers clusters that accurately match semantic classes, achieving state-of-the-art results in eight unsupervised clustering benchmarks spanning image classification and segmentation. These include STL10, an unsupervised variant of ImageNet, and CIFAR10, where we significantly beat the accuracy of our closest competitors by 8 and 9.5 absolute percentage points respectively. The method is not specialised to computer vision and operates on any paired dataset samples; in our experiments we use random transforms to obtain a pair from each image. The trained network directly outputs semantic labels, rather than high dimensional representations that need external processing to be usable for semantic clustering. The objective is simply to maximise mutual information between the class assignments of each pair. It is easy to implement and rigorously grounded in information theory, meaning we effortlessly avoid degenerate solutions that other clustering methods are susceptible to. In addition to the fully unsupervised mode, we also test two semi-supervised settings. The first achieves 88.8% accuracy on STL10 classification, setting a new global state-of-the-art over all existing methods (whether supervised, semi supervised or unsupervised). The second shows robustness to 90% reductions in label coverage, of relevance to applications that wish to make use of small amounts of labels. [GitHub]”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
RMDL : Random Multimodel Deep Learning for Classification (May 2018) [code | discussion (reddit)]
[Image source. Click image to open in new window.]
- “This paper introduces Random Multimodel Deep Learning (RMDL ): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RMDL can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RMDL produces consistently better performance than standard methods over a broad range of data types and classification problems.”
Neural Vector Conceptualization for Word Vector Space Interpretation (Apr 2019) [code]
“Distributed word vector spaces are considered hard to interpret which hinders the understanding of natural language processing (NLP) models. In this work, we introduce a new method to interpret arbitrary samples from a word vector space. To this end, we train a neural model to conceptualize word vectors, which means that it activates higher order concepts it recognizes in a given vector. Contrary to prior approaches, our model operates in the original vector space and is capable of learning non-linear relations between word vectors and concepts. Furthermore, we show that it produces considerably less entropic concept activation profiles than the popular cosine similarity.”
[Image source. Click image to open in new window.]
Simple BERT Models for Relation Extraction and Semantic Role Labeling (Apr 2019)
- “We present simple BERT -based models for relation extraction and semantic role labeling. In recent years, state-of-the-art performance has been achieved using neural models by incorporating lexical and syntactic features such as part-of-speech tags and dependency trees. In this paper, extensive experiments on datasets for these two tasks show that without using any external features, a simple BERT -based model can achieve state-of-the-art performance. To our knowledge, we are the first to successfully apply BERT in this manner. Our models provide strong baselines for future research.”
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents (2018) [code; mention (reddit)]
- “Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single,longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers [PubMed; arXiv] show that our model significantly outperforms state-of-the-art models.”
Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding (Apr 2019) established “strong baselines for event temporal relation extraction on two under-explored story narrative datasets… We demonstrate that neural network-based models can outperform some strong traditional linguistic feature-based models. We also conduct comparative studies to show the contribution of adopting contextualized word embeddings (BERT ) for event temporal relation extraction from stories….”
X-BERT : eXtreme Multi-label Text Classification with BERT (May 2019) [“all the data split andsource codes will be made publicly available”]
- “Extreme multi-label text classification (XMC) aims to tag each input text with the most relevant labels from an extremely large label set, such as those that arise in product categorization and e-commerce recommendation. Recently, pretrained language representation models such as BERT achieve remarkable state-of-the-art performance across a wide range of NLP tasks including sentence classification among small label sets (typically fewer than thousands). Indeed, there are several challenges in applying BERT to the XMC problem. The main challenges are: (i) the difficulty of capturing dependencies and correlations among labels, whose features may come from heterogeneous sources, and (ii) the tractability to scale to the extreme label setting as the model size can be very large and scale linearly with the size of the output space. To overcome these challenges, we propose X-BERT , the first feasible attempt to finetune BERT models for a scalable solution to the XMC problem. Specifically, X-BERT leverages both the label and document text to build label representations, which induces semantic label clusters in order to better model label dependencies. At the heart of X-BERT is finetuning BERT models to capture the contextual relations between input text and the induced label clusters. Finally, an ensemble of the different BERT models trained on heterogeneous label clusters leads to our best final model. Empirically, on a Wiki dataset with around 0.5 million labels, X-BERT achieves new state-of-the-art results where the precision@1 reaches 67:80%, a substantial improvement over 32.58%/60.91% of deep learning baseline fastText and competing XMC approach Parabel, respectively. This amounts to a 11.31% relative improvement over Parabel, which is indeed significant since the recent approach SLICE only has 5.53% relative improvement.”
[Image source. Click image to open in new window.]
Enhancing Clinical Concept Extraction with Contextual Embedding (May 2019)
- “Neural network-based representations (“embeddings”) have dramatically advanced natural language processing (NLP) tasks, including clinical NLP tasks such as concept extraction. Recently, however, more advanced embedding methods and representations (e.g., ELMo, BERT) have further pushed the state-of-the-art in NLP, yet there are no common best practices for how to integrate these representations into clinical tasks. The purpose of this study, then, is to explore the space of possible options in utilizing these new models for clinical concept extraction, including comparing these to traditional word embedding methods (word2vec, GloVe, fastText). Both off-the-shelf, open-domain embeddings and pre-training clinical embeddings from MIMIC-III are evaluated. We explore a battery of embedding methods consisting of traditional word embeddings and contextual embeddings, and compare these on four concept extraction corpora: i2b2 2010, i2b2 2012, SemEval 2014, and SemEval 2015. We also analyze the impact of the pre-training time of a large language model like ELMo or BERT on the extraction performance. Finally, we present an intuitive way to understand the semantic information encoded by contextual embeddings. Contextual embeddings pre-trained on a large clinical corpus achieves new state-of-the-art performances across all concept extraction tasks. The best-performing model outperforms all state-of-the-art methods with respective $\small F_1$-measures of 90.25, 93.18 (partial), 80.74, and 81.65. We demonstrate the potential of contextual embeddings through the state-of-the-art performance these methods achieve on clinical concept extraction. Additionally, we demonstrate contextual embeddings encode valuable semantic information not accounted for in traditional word representations.”
CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor (May 2019)
- “Extracting key information from documents, such as receipts or invoices, and preserving the interested texts to structured data is crucial in the document-intensive streamline processes of office automation in areas that includes but not limited to accounting, financial, and taxation areas. To avoid designing expert rules for each specific type of document, some published works attempt to tackle the problem by learning a model to explore the semantic context in text sequences based on the Named Entity Recognition (NER) method in the NLP field. In this paper, we propose to harness the effective information from both semantic meaning and spatial distribution of texts in documents. Specifically, our proposed model, Convolutional Universal Text Information Extractors (CUTIE ), applies convolutional neural networks on gridded texts where texts are embedded as features with semantical connotations. We further explore the effect of employing different structures of convolutional neural network and propose a fast and portable structure. We demonstrate the effectiveness of the proposed method on a dataset with up to 4,484 labelled receipts, without any pre-training or post-processing, achieving state of the art performance that is much higher than BERT but with only 1/10 parameters and without requiring the 3,300M word dataset for pre-training. Experimental results also demonstrate that the CUTIE being able to achieve state of the art performance with much smaller amount of training data.”
Using Ontologies To Improve Performance In Massively Multi-label Prediction Models (Stanford | Google Brain: May 2019)
[Image source. Click image to open in new window.]
“Massively multi-label prediction/classification problems arise in environments like health-care or biology where very precise predictions are useful. One challenge with massively multi-label problems is that there is often a long-tailed frequency distribution for the labels, which results in few positive examples for the rare labels. We propose a solution to this problem by modifying the output layer of a neural network to create a Bayesian network of sigmoids which takes advantage of ontology relationships between the labels to help share information between the rare and the more common labels. We apply this method to the two massively multi-label tasks of disease prediction (ICD-9 codes) and protein function prediction (Gene Ontology terms) and obtain significant improvements in per-label AUROC and average precision for less common labels.”
Transforming Complex Sentences into a Semantic Hierarchy (Jun 2019) [code | Graphene project, code]
“We present an approach for recursively splitting and rephrasing complex English sentences into a novel semantic hierarchy of simplified sentences, with each of them presenting a more regular structure that may facilitate a wide variety of artificial intelligence tasks, such as machine translation (MT) or information extraction (IE). … the proposed syntactic simplification approach outperforms the state of the art in structural text simplification. Moreover, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall. …”
[Image source. Click image to open in new window.]
Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Aug 2018) [project, code]
[Image source. Click image to open in new window.]
- “We introduce a multi-task setup of identifying entities, relations, and coreference clusters in scientific articles. We create SCIERC, a dataset that includes annotations for all three tasks and develop a unified framework called SCIIE with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.”
Document-Level $\small N$-ary Relation Extraction with Multiscale Representation Learning (Stanford University | Microsoft Research: Jun 2019) [code: “Our code and data will be available at hanover.azurewebsites.net/”] “Most information extraction methods focus on binary relations expressed within single sentences. In high-value domains, however, $\small n$-ary relations are of great demand (e.g., drug-gene-mutation interactions in precision oncology). Such relations often involve entity mentions that are far apart in the document, yet existing work on cross-sentence relation extraction is generally confined to small text spans (e.g., three consecutive sentences), which severely limits recall. In this paper, we propose a novel multiscale neural architecture for document-level $\small n$-ary relation extraction. Our system combines representations learned over various text spans throughout the document and across the subrelation hierarchy. Widening the system’s purview to the entire document maximizes potential recall. Moreover, by integrating weak signals across the document, multiscale modeling increases precision, even in the presence of noisy labels from distant supervision. Experiments on biomedical machine reading show that our approach substantially outperforms previous $\small n$-ary relation extraction methods.”
[Image source. Click image to open in new window.]NSEEN : Neural Semantic Embedding for Entity Normalization (Jun 2019). “Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. … we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.”
[Image source. Click image to open in new window.]MinIE : Minimizing Facts in Open Information Extraction (2017) [code]. “The goal of Open Information Extraction (OIE) is to extract surface relations and their arguments from natural-language text in an unsupervised, domain-independent manner. In this paper, we propose MinIE , an OIE system that aims to provide useful, compact extractions with high precision and recall. MinIE approaches these goals by (1) representing information about polarity, modality, attribution, and quantities with semantic annotations instead of in the actual extraction, and (2) identifying and removing parts that are considered overly specific. We conducted an experimental study with several real-world datasets and found that MinIE achieves competitive or higher precision and recall than most prior systems, while at the same time producing shorter, semantically enriched extractions.”
-
MinIE was the best performing IE system in: WiRe57 : A Fine-Grained Benchmark for Open Information Extraction (University of Montreal: Aug 2019) [code]. “We build a reference for the task of Open Information Extraction, on five documents. We tentatively resolve a number of issues that arise, including inference and granularity. We seek to better pinpoint the requirements for the task. We produce our annotation guidelines specifying what is correct to extract and what is not. In turn, we use this reference to score existing Open IE systems. We address the non-trivial problem of evaluating the extractions produced by systems against the reference tuples, and share our evaluation script. Among seven compared extractors, we find the MinIE system to perform best.”
[Image source. Click image to open in new window.]
- Hamming Sentence Embeddings for Information Retrieval (RheinMain University of Applied Sciences: Aug 2019) [code | results]. “In retrieval applications, binary hashes are known to offer significant improvements in terms of both memory and speed. We investigate the compression of sentence embeddings using a neural encoder-decoder architecture, which is trained by minimizing reconstruction error. Instead of employing the original real-valued embeddings, we use latent representations in Hamming space produced by the encoder for similarity calculations. In quantitative experiments on several benchmarks for semantic similarity tasks, we show that our compressed hamming embeddings yield a comparable performance to uncompressed embeddings (Sent2Vec , InferSent , Glove-BoW ), at compression ratios of up to 256:1. We further demonstrate that our model strongly decorrelates input features, and that the compressor generalizes well when pre-trained on Wikipedia sentences. We publish the source code on Github and all experimental results.”