### Technical Review

Natural Language Processing

# NATURAL LANGUAGE PROCESSING

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. NLP is foundational to all information extraction and natural language tasks.

[Image source (slide 8). Click image to open in new window.]

Recent reviews of NLP relevant to this TECHNICAL REVIEW include:

Regarding the latter review, note my comments in the reddit thread Recent Trends in Deep Learning Based Natural Language Processing, which indicates an “issue” regarding any review (or proposed work) in the NLP and machine learning domains: the extraordinarily rapid rates of progress. During the course of preparing this REVIEW, highly-relevant literature and developments appeared almost daily on arXiv.org, my RSS feeds, and other sources. I firmly believe that this rapid progress represents outstanding research opportunities rather than barriers (e.g., proposing ML research that may quickly become “dated”).

Lastly, high-profile Ph.D. student/blogger Sebastian Ruder actively tracks progress in numerous subdomains in the NLP domain at NLP Progress  (alternate link).

Basic steps associated with NLP include text retrieval and preprocessing steps, including:

• sentence splitting
• tokenization
• named entity recognition
• addressing polysemy and named entity disambiguation (e.g. “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”)
• word sense disambiguation
• event extraction
• part-of-speech tagging
• syntactic/dependency parsing
• for background on the preceding steps, see
• Complex Event Extraction at PubMed Scale
• PKDE4J: Entity and Relation Extraction for Public Knowledge Discovery (Oct 2015) [codepkde4j.zip (local copy)] [Summary]Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F-measures of 85% for entity extraction and 81% for relation extraction.
Used in: Relation extraction for biological pathway construction using node2vec (Jun 2018),
• Application of Public Knowledge Discovery Tool (PKDE4J) to Represent Biomedical Scientific Knowledge (Feb 2018) [Summary]In today’s era of information explosion, extracting entities and their relations in large-scale, unstructured collections of text to better represent knowledge has emerged as a daunting challenge in biomedical text mining. To respond to the demand to automatically extract scientific knowledge with higher precision, the public knowledge discovery tool PKDE4J (Song et al., 2015) was proposed as a flexible text-mining tool. In this study, we propose an extended version of PKDE4J to represent scientific knowledge for literature-based knowledge discovery. Specifically, we assess the performance of PKDE4J in terms of three extraction tasks: entity, relation, and event detection. We also suggest applications of PKDE4J along three lines: (1) knowledge search, (2) knowledge linking, and (3) knowledge inference. We first describe the updated features of PKDE4J and report on tests of its performance. With additional options in the processes of named entity extraction, verb expansion, and event detection, we expect that the enhanced PKDE4J can be utilized for literature-based knowledge discovery.
• Extracting Information from Text
• relation extraction
• for background on relation extraction, see
• basic clustering approaches
• see (e.g.) the support vector machine and conditional random field descriptions in

Additional NLP preprocessing steps may be included (or some of the steps above may be omitted), and the order of some of those steps may vary slightly.

Some recent ML approaches to NLP tasks include:

• Event detection/extraction:
• Joint Extraction of Events and Entities within a Document Context [Summary]Events and entities are closely related; entities are often actors or participants in events and events without entities are uncommon. The interpretation of events and entities is highly contextually dependent. Existing work in information extraction typically models events separately from entities, and performs inference at the sentence level, ignoring the rest of the document. In this paper, we propose a novel approach that models the dependencies among variables of events, entities, and their relations, and performs joint inference of these variables across a document. The goal is to enable access to document-level contextual information and facilitate context-aware predictions. We demonstrate that our approach substantially outperforms the state-of-the-art methods for event extraction as well as a strong baseline for entity extraction. [Code]
• Semi-Supervised Event Extraction with Paraphrase Clusters
• Event Detection with Neural Networks: A Rigorous Empirical Evaluation
• Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation (Oct 2018) [code] [Summary]“… we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information.” Employs embedding vectors, Bi-LSTM layer, GCN layer, self-attention.
• Biomedical Event Extraction Based on GRU Integrating Attention Mechanism (2018)
• One for All: Neural Joint Modeling of Entities and Events [Summary“The previous work for event extraction has mainly focused on the predictions for event triggers and argument roles, treating entity mentions as being provided by human annotators. This is unrealistic as entity mentions are usually predicted by some existing toolkits whose errors might be propagated to the event trigger and argument role recognition. Few of the recent work has addressed this problem by jointly predicting entity mentions, event triggers and arguments. However, such work is limited to using discrete engineering features to represent contextual information for the individual tasks and their interactions. In this work, we propose a novel model to jointly perform predictions for entity mentions, event triggers and arguments based on the shared hidden representations from deep learning. The experiments demonstrate the benefits of the proposed method, leading to the state-of-the-art performance for event extraction.”

“… In the future, we plan to improve the end-to-end model so EE can be solved from just the raw sentences and the word embeddings.”
|
• Bidirectional Long Short-Term Memory with CRF for Detecting Biomedical Event Trigger in FastText Semantic Space (Dec 2018)   [Summary]BACKGROUND: In biomedical information extraction, event extraction plays a crucial role. Biological events are used to describe the dynamic effects or relationships between biological entities such as proteins and genes. Event extraction is generally divided into trigger detection and argument recognition. The performance of trigger detection directly affects the results of the event extraction. In general, the traditional method is used to address the trigger detection as a classification task, as well as the use of machine learning or rules method, which construct many features to improve the classification results. Moreover, the classification model only recognizes triggers composed of single words, whereas for multiple words, the result is unsatisfactory.
RESULTS: The corpus of our model is MLEE. If we were to only use the biomedical LSTM and CRF model without other features, the F-score would reach about 78.08%. Comparing entity to part of speech (POS), we find the entity features more conducive to the improvement of performance of detection, with the F-score potentially reaching about 80%. Furthermore, we also experiment on the other three corpora (BioNLP 2009, BioNLP 2011, and BioNLP 2013) to verify the generalization of our model. Hence, F-scores can reach more than 60%, which are better than the comparative experiments.
CONCLUSIONS: The trigger recognition method based on the sequence annotation model does not require initial complex feature engineering, and only requires a simple labeling mechanism to complete the training. Therefore, generalization of our model is better compared to other traditional models. Secondly, this method can identify multi-word triggers, thereby improving the F-scores of trigger recognition. Thirdly, details on the entity have a crucial impact on trigger detection. Finally, the combination of character-level word embedding and word-level word embedding provides increasingly effective information for the model; therefore, it is a key to the success of the experiment.

• Named entity recognition:
• Joint Extraction of Events and Entities within a Document Context [Summary]Events and entities are closely related; entities are often actors or participants in events and events without entities are uncommon. The interpretation of events and entities is highly contextually dependent. Existing work in information extraction typically models events separately from entities, and performs inference at the sentence level, ignoring the rest of the document. In this paper, we propose a novel approach that models the dependencies among variables of events, entities, and their relations, and performs joint inference of these variables across a document. The goal is to enable access to document-level contextual information and facilitate context-aware predictions. We demonstrate that our approach substantially outperforms the state-of-the-art methods for event extraction as well as a strong baseline for entity extraction. [Code]
• Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition (Jul 2017) [codecorpora] [Summary]Motivation. Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult.
Results. We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
Availability and implementation. The source code for LSTM-CRF is available at GitHub and the links to the corpora are available at Corposaurus.
• Label-aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition (Apr 2018) [code (n/a, Dec 2018)[Summary]We study the problem of named entity recognition (NER) from electronic medical records, which is one of the most fundamental and critical problems for medical text mining. Medical records which are written by clinicians from different specialties usually contain quite different terminologies and writing styles. The difference of specialties and the cost of human annotation makes it particularly difficult to train a universal medical NER system. In this paper, we propose a label-aware double transfer learning framework (La-DTL) for cross-specialty NER, so that a medical NER system designed for one specialty could be conveniently applied to another one with minimal annotation efforts. The transferability is guaranteed by two components: (i) we propose label-aware MMD for feature representation transfer, and (ii) we perform parameter transfer with a theoretical upper bound which is also label aware. We conduct extensive experiments on 12 cross-specialty NER tasks. The experimental results demonstrate that La-DTL provides consistent accuracy improvement over strong baselines. Besides, the promising experimental results on non-medical NER scenarios indicate that La-DTL is potential to be seamlessly adapted to a wide range of NER tasks.
• GRAM-CNN: A Deep Learning Approach with Local Context for Named Entity Recognition in Biomedical Text (May 2018) [code] [Summary]… We propose a novel end-to-end deep learning approach for biomedical NER tasks that leverages the local contexts based on n-gram character and word embeddings via convolutional neural network (CNN). We call this approach GRAM-CNN. To automatically label a word, this method uses the local information around a word. Therefore, the GRAM-CNN method does not require any specific knowledge or feature engineering and can be theoretically applied to a wide range of existing NER problems. The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1 score of 87.26% on the Biocreative II dataset, 87.26% on the NCBI dataset and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods. To the best of our knowledge, we are the first to apply CNN based structures to BioNER problems.
• Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition (Aug 2018); see also:
• Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition
• Learning Named Entity Tagger using Domain-Specific Dictionary  [code]
• Wide-scope Biomedical Named Entity Recognition and Normalization with CRFs, Fuzzy Matching and Character level Modeling
• Neural Entity Reasoner for Global Consistency in Named Entity Recognition
• A Byte-sized Approach to Named Entity Recognition  [code
• CollaboNet: Collaboration of Deep Neural Networks for Biomedical Named Entity Recognition  [code]
• Neural Segmental Hypergraphs for Overlapping Mention Recognition
• Learning to Recognize Discontiguous Entities
• Neural Adaptation Layers for Cross-domain Named Entity Recognition
• Named Entity Analysis and Extraction with Uncommon Words
• An Instance Transfer based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition
• Clinical Concept Extraction with Contextual Word Embedding [code] [Summary]Automatic extraction of clinical concepts is an essential step for turning the unstructured data within a clinical note into structured and actionable information. In this work, we propose a clinical concept extraction model for automatic annotation of clinical problems, treatments, and tests in clinical notes utilizing domain-specific contextual word embedding. A contextual word embedding model is first trained on a corpus with a mixture of clinical reports and relevant Wikipedia pages in the clinical domain. Next, a bidirectional LSTM-CRF model is trained for clinical concept extraction using the contextual word embedding model. We tested our proposed model on the I2B2 2010 challenge dataset. Our proposed model achieved the best performance among reported baseline models and outperformed the state-of-the-art models by 3.4% in terms of F1-score.”
Keywords: ELMo; Bi-RNN; NER
• GraphIE: A Graph-Based Framework for Information Extraction
• Learning Better Internal Structure of Words for Sequence Labeling
• Gradual Machine Learning for Entity Resolution
• Comparison of Named Entity Recognition Methodologies in Biomedical Documents
• Few-shot Learning for Named Entity Recognition in Medical Text [Summary]Deep neural network models have recently achieved state-of-the-art performance gains in a variety of natural language processing (NLP) tasks (Young, Hazarika, Poria, & Cambria, 2017). However, these gains rely on the availability of large amounts of annotated examples, without which state-of-the-art performance is rarely achievable. This is especially inconvenient for the many NLP fields where annotated examples are scarce, such as medical text. To improve NLP models in this situation, we evaluate five improvements on named entity recognition (NER) tasks when only ten annotated examples are available: (1) layer-wise initialization with pre-trained weights, (2) hyperparameter tuning, (3) combining pre-training data, (4) custom word embeddings, and (5) optimizing out-of-vocabulary (OOV) words. Experimental results show that the F1 score of 69.3% achievable by state-of-the-art models can be improved to 78.87%.
• Unnamed Entity Recognition of Sense Mentions [SummaryWe consider the problem of recognizing mentions of human senses in text. Our contribution is a method for acquiring labeled data, and a learning method that is trained on this data. Experiments show the effectiveness of our proposed data labeling approach and our learning model on the task of sense recognition in text.  |
• MER: A Shell Script and Annotation Server for Minimal Named Entity Recognition and Linking (Dec 2018) [code  |  demo] [Summary]Named-entity recognition aims at identifying the fragments of text that mention entities of interest, that afterwards could be linked to a knowledge base where those entities are described. This manuscript presents our minimal named-entity recognition and linking tool (MER ), designed with flexibility, autonomy and efficiency in mind. To annotate a given text, MER only requires: (1) a lexicon (text file) with the list of terms representing the entities of interest; (2) optionally a tab-separated values file with a link for each term; (3) and a Unix shell. Alternatively, the user can provide an ontology from where MER will automatically generate the lexicon and links files. The efficiency of MER derives from exploring the high performance and reliability of the text processing command-line tools grep and awk, and a novel inverted recognition technique. MER was deployed in a cloud infrastructure using multiple Virtual Machines to work as an annotation server and participate in the Technical Interoperability and Performance of annotation Servers task of BioCreative V.5. The results show that our solution processed each document (text retrieval and annotation) in less than 3 sec on average without using any type of cache. MER was also compared to a state-of-the-art dictionary lookup solution obtaining competitive results not only in computational performance but also in precision and recall. MER is publicly available in a GitHub repository and through a RESTful Web service.
• End-to-end Joint Entity Extraction and Negation Detection for Clinical Text (Dec 2018) [Summary]Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for information extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the \textit{Conditional Softmax Shared Decoder} architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.
• Dynamic Transfer Learning for Named Entity Recognition (Dec 2018) [Summary]State-of-the-art named entity recognition (NER) systems have been improving continuously using neural architectures over the past several years. However, many tasks including NER require large sets of annotated data to achieve such performance. In particular, we focus on NER from clinical notes, which is one of the most fundamental and critical problems for medical text analysis. Our work centers on effectively adapting these neural architectures towards low-resource settings using parameter transfer methods. We complement a standard hierarchical NER model with a general transfer learning framework consisting of parameter sharing between the source and target tasks, and showcase scores significantly above the baseline architecture. These sharing schemes require an exponential search over tied parameter sets to generate an optimal configuration. To mitigate the problem of exhaustively searching for model optimization, we propose the Dynamic Transfer Networks (DTN), a gated architecture which learns the appropriate parameter sharing scheme between source and target datasets. DTN achieves the improvements of the optimized transfer learning framework with just a single training setting, effectively removing the need for exponential search.
• A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization (Dec 2018)
• A Neural Network Approach to Chemical and Gene/Protein Entity Recognition in Patents (Dec 2018) [Summary][Code not provided.] In biomedical research, patents contain the significant amount of information, and biomedical text mining has received much attention in patents recently. To accelerate the development of biomedical text mining for patents, the BioCreative V.5 challenge organized three tracks, i.e., chemical entity mention recognition (CEMP), gene and protein related object recognition (GPRO) and technical interoperability and performance of annotation servers, to focus on biomedical entity recognition in patents. This paper describes our neural network approach for the CEMP and GPRO tracks. In the approach, a bidirectional long short-term memory with a conditional random field layer is employed to recognize biomedical entities from patents. To improve the performance, we explored the effect of additional features (i.e., part of speech, chunking and named entity recognition features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (a precision of 88.32%, a recall of 92.62%, and an F-score of 90.42% in the CEMP track; a precision of 76.65%, a recall of 81.91%, and an F-score of 79.19% in the GPRO track) among all participating teams in both tracks.
• Chemlistem: Chemical Named Entity Recognition using Recurrent Neural Networks (Dec 2018) [code] [Summary]Chemical named entity recognition (NER) has traditionally been dominated by conditional random fields (CRF)-based approaches but given the success of the artificial neural network techniques known as “deep learning” we decided to examine them as an alternative to CRFs. We present here several chemical named entity recognition systems. The first system translates the traditional CRF-based idioms into a deep learning framework, using rich per-token features and neural word embeddings, and producing a sequence of tags using bidirectional long short term memory (LSTM) networks – a type of recurrent neural net. The second system eschews the rich feature set – and even tokenisation – in favour of character labelling using neural character embeddings and multiple LSTM layers. The third system is an ensemble that combines the results of the first two systems. Our original BioCreative V.5 competition entry was placed in the top group with the highest F scores, and subsequent using transfer learning have achieved a final F score of 90.33% on the test data (precision 91.47%, recall 89.21%).
• Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning (Jan 2018; updated Oct 2018) [codeupdated code] [Summary][University of Illinois – Urbana-Champaign  |  UCLA  |  Stanford University]
Motivation: State-of-the-art biomedical named entity recognition (BioNER) systems often require handcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recent studies explored using neural network models for BioNER to free experts from manual feature engineering, the performance remains limited by the available training data for each entity type.
Results: We propose a multi-task learning framework for BioNER to collectively use the training data of different types of entities and improve the performance on each of them. In experiments on 15 benchmark BioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows that the large performance gains come from sharing character- and word-level information among relevant biomedical entities across differently labeled corpora.
• A Survey on Deep Learning for Named Entity Recognition (Dec 2018)

• Polysemy, hypernymy:
• POS tagging:
• Improving part-of-speech tagging via multi-task learning and character-level word representations
• [Karin Verspoor]  An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing (Aug 2018) [code] [Summary]We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental results on parsing 61 “big” Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Straková, 2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models here.
• Updates [Karin Verspoor]  A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing (Jun 2017) [Summary] We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models here.
• Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) [Summary]Character-based neural models have recently proven very useful for many NLP tasks. However, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, in spite of considerable research on learning character embeddings, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, we first investigate the gaps between methods for learning word and sentence representations. We conduct detailed experiments and comparisons of different state-of-the-art convolutional models, and also investigate the advantages and disadvantages of their constituents. Furthermore, we propose IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. We evaluate our proposed model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. Our in-depth analysis shows that IntNet significantly outperforms other character embedding models and obtains new state-of-the-art performance without relying on any external knowledge or resources.
• [Karin Verspoor]  From POS Tagging to Dependency Parsing for Biomedical Event Extraction (Jan 2019) [code] [Summary]Background: Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance. Results: We perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core natural language processing tasks of part-of-speech (POS) tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. Experimental results show that in general, the neural models outperform the feature-based models on two benchmark biomedical corpora GENIA and CRAFT. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction, and show that better intrinsic parsing performance does not always imply better extrinsic event extraction performance. Conclusion: We have presented a detailed empirical study comparing traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context, and also investigated the influence of parser selection for a biomedical event extraction downstream task. Availability of data and material: We make the retrained models available here.

• Word sense disambiguation:

Again, that is not an exhaustive list – merely some articles that I have recently encountered that are relevant to my interests.

## NLP: Selected Papers

Cross-sentence $\small n$-ary relation extraction detects relations among $\small n$ entities across multiple sentences. Typical methods formulate an input as a document graph, integrating various intra-sentential and inter-sentential dependencies. The current state of the art method splits the input graph into two DAG [directed acyclic graph], adopting a DAG-structured LSTM for each. Though being able to model rich linguistic knowledge by leveraging graph edges, important information can be lost in the splitting procedure. Song et al. (August 2018: N-ary Relation Extraction using Graph State LSTM  [code]) proposed a graph-state LSTM model, which used a parallel state to model each word, recurrently enriching state values via message passing. Compared with DAG LSTM, their graph LSTM kept the original graph structure, and sped up computation by allowing more parallelization. For example, given

“The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the 858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”

… their model conveyed the fact that cancers caused by the 858E mutation in the EGFR gene can respond to the anticancer drug gefitinib: the three entity mentions appeared in separate sentences yet formed a ternary relation. On a standard benchmark, their model outperformed a bidirectional DAG LSTM baseline by 5.9% in accuracy, overtaking the state of the art system of Peng et al. (2017) by 1.2%.

Song et al.’s code was an implementation of Peng et al.’s Cross-Sentence N-ary Relation Extraction with Graph LSTMs (different authors/project; project/code]), modified in regard to the edge labels (discussed by Song et al. in their paper).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction (Nov 2018) proposed a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploited word embeddings and positional embeddings for cross-sentence $\small n$-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture the most informative features, which are essential for cross-sentence $\small n$-ary relation extraction. Their LSTM-CNN model was evaluated on standard datasets for cross-sentence $\small n$-ary relation extraction, where it significantly outperformed baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also showed that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence $\small n$-ary relation extraction.

• “However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in $\small \text{LISTING 1}$, there exists a ternary relation response across three entities: $\small \text{EGFR}$ , $\small \text{L858E}$ , $\small \text{gefitnib}$ appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence $\small n$-ary relation extraction.

$\small \text{Listing 1: Text span of two consecutive sentences}$

'The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10.   All patients were treated with gefitnib and showed a partial response. '

“This paper focuses on the cross-sentence $\small n$-ary relation extraction task. Formally, let $\small \{e_1, \ldots ,e_n\}$ be the set of entities in a text span $\small S$ containing $\small t$ number of consecutive sentences. For example, in the text span comprising 2 sentences ($\small t = 2$) in $\small \text{Listing 1}$ above, given cancer patients with mutation $\small v$ (EGFR) in gene $\small g$ (L858E), the patients showed a partial response to drug $\small d$ (gefitnib). Thus, a ternary relation response ( $\small \text{EGFR}$, $\small \text{L858E}$, $\small \text{gefitnib}$) exists among the three entities spanning across the two sentences in $\small \text{Listing 1}$.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Likewise, Neural Relation Extraction Within and Across Sentence Boundaries (Oct 2018) proposed an architecture for relation extraction in entity pairs spanning multiple sentences: inter-sentential dependency-based neural networks (iDepNN). iDepNN modeled the shortest and augmented dependency paths via recurrent and recursive neural networks to extract relationships within (intra-) and across (inter-) sentence boundaries. Compared to SVM and neural network baselines, iDepNN was more robust to false positives in relationships spanning sentences. The authors evaluated their model on four datasets from newswire (MUC6) and medical (BioNLP shared tasks) domains, that achieved state of the art performance and showed a better balance in precision and recall for inter-sentential relationships – performing better than 11 teams participating in the BioNLP Shared Task 2016, achieving a gain of 5.2% (0.587 vs 0.558) in $\small F_1$ over the winning team. They also released the crosssentence annotations for MUC6.

[Image source. . Click image to open in new window.]

[Image source. . Click image to open in new window.]

[Image source. . Click image to open in new window.]

Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. They showed that their model was able to capture features and interactions (the model was robust in handling both overlapping and non-overlapping mentions) that could not be captured by previous models, while maintaining a low time complexity for inference.

In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (above), Learning to Recognize Discontiguous Entities (Oct 2018) focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.

Most modern Information Extraction (IE) systems are implemented as sequential taggers and focus on modelling local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. GraphIE: A Graph-Based Framework for Information Extraction (Oct 2018) is a framework that operates over a graph representing both local and non-local dependencies between textual units (i.e. words or sentences). The algorithm propagated information between connected nodes through graph convolutions and exploited the richer representation to improve word level predictions. Results on three different tasks – social media, textual and visual information extraction – showed that GraphIE outperformed a competitive baseline (BiLSTM+CRF) in all tasks by a significant margin.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

While character-based neural models have proven useful for many NLP tasks, there is a gap of sophistication between methods for learning representations of sentences and words. While most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, Learning Better Internal Structure of Words for Sequence Labeling (Oct 2018) first investigated the gaps between methods for learning word and sentence representations. Furthermore, they proposed IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. They evaluated their model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. IntNet significantly outperformed other character embedding models and obtained new state of the art performance without relying on any external knowledge or resources.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

State-of-the-art studies have demonstrated the superiority of joint modelling over pipeline implementation for medical named entity recognition and normalization due to the mutual benefits between the two processes. A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization (Dec 2018) [code  (empty repo, 2018-12-17)] proposed a novel deep neural multitask learning framework with explicit feedback strategies to jointly modeled recognition and normalization. Their method benefitted from the general representations of both tasks provided by multitask learning, and successfully converted hierarchical tasks into a parallel multitask setting while maintaining the mutual support between tasks. Their method performed significantly better than state of the art approaches on two publicly available medical literature datasets.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Natural Language Processing:

• “Relation classification is an important semantic processing task in the field of natural language processing (NLP). State of the art systems still rely on lexical resources such as WordNet or NLP systems like dependency parser and named entity recognizers (NER) to get high-level features. Another challenge is that important information can appear at any position in the sentence. To tackle these problems, we propose Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM) to capture the most important semantic information in a sentence. The experimental results on the SemEval-2010 relation classification task show that our method outperforms most of the existing methods, with only word vectors.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image sourceGML: gradual machine learning (this paper); UR: unsupervised rule-based; UC: unsupervised clustering; SVM: support vector machine; DNN: Deep Learning.  Click image to open in new window.]

Our evaluation is conducted on three real datasets, which are described as follows:

• DS (DBLP-Scholar 3): The DS dataset contains the publication entities from DBLP and the publication entities from Google Scholar. The experiments match the DBLP entries with the Scholar entries.
• AB (Abt-Buy 4): The AB dataset contains the product entities from both Abt.com and Buy.com. The experiments match the Abt entries with the Buy entries.
• SG (Songs 5): The SG dataset contains song entities, some of which refer to the same songs. The experiments match the song entries in the same table.
• “Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “Background. Biomedical named entity recognition (Bio-NER) is a fundamental task in handling biomedical text terms, such as RNA, protein, cell type, cell line, and DNA. Bio-NER is one of the most elementary and core tasks in biomedical knowledge discovery from texts. The system described here is developed by using the BioNLP/NLPBA 2004 shared task. Experiments are conducted on a training and evaluation set provided by the task organizers.
Results. Our results show that, compared with a baseline having a 70.09% F1 score, the RNN Jordan-type and Elman-type algorithms have $\small F_1$ scores of approximately 60.53% and 58.80%, respectively. When we use CRF as a machine learning algorithm, CCA, GloVe, and word2vec have $\small F_1$ of 72.73%, 72.74%, and 72.82%, respectively.”

“In this paper, we use five categories (protein, DNA, RNA, cell type, and cell line) instead of the categories used in the ordinary NER process. An example of the NER tagged sentence is as follows: ‘IL-2 [ B-protein ] responsiveness requires three distinct elements [ B-DNA ] within the enhancer [ B-DNA ].’

[Image source. Click image to open in new window.]
• “Reliable uncertainty quantification is a first step towards building explainable, transparent, and accountable artificial intelligent systems. Recent progress in Bayesian deep learning has made such quantification realizable. In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLP tasks.”

“In this work, we evaluate the benefits of quantifying uncertainties in modern neural network models applied in the context of three different natural language processing tasks. We conduct experiments on sentiment analysis, named entity recognition, and language modeling tasks with convolutional and recurrent neural network models. We show that by quantifying both uncertainties, model performances are improved across the three tasks. We further investigate the characteristics of inputs with high and low data uncertainty measures in Yelp 2013 and CoNLL 2003 datasets. For both datasets, our model estimates higher data uncertainties for more difficult predictions.”

[Image source. Click image to open in new window.]
• “In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable”

[Image source. Click image to open in new window.]

• An Attention-Based BiLSTM-CRF Approach To Document-Level Chemical Named Entity Recognition (Apr 2018) [code] proposed a neural network approach, an attention-based bidirectional long short-term memory with a conditional random field layer (Att-BiLSTM-CRF), to document-level chemical NER. The approach leveraged document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. Their method used word and character embeddings as basic features. In addition, to investigate the effects of traditional features for deep learning methods, POS, chunking and dictionary features were added into the models as additional features. Att-BiLSTM-CRF achieved better performance with little feature engineering than other state of the art methods on the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus and the BioCreative V chemical-disease relation (CDR) task corpus (F-scores of 91.14 and 92.57%, respectively).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• LSTMVoter: Chemical Named Entity Recognition Using a Conglomerate of Sequence Labeling Tools (Jan 2019) [code] introduced LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilized a conditional random field (CRF) layer in conjunction with attention-based feature modeling. Their approach explored information about features that is modeled by means of an attention mechanism. LSTMVoter outperformed each individual extractor integrated into it. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieved an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieved an F1-score of 89.01%. [This model is very similar to but outperformed by the model described in Comparing CNN and LSTM Character-Level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]