### These Contents

The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time (Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning). Even within the more focused health, technical and scientific domains we face a continuous onslaught of new information and knowledge from which we must filter out the non-relevant information, seeking to retain (or hoping to find again) knowledge that is relevant to us.

As I discussed in my post, PubMed Publications by Year, the volume of publications in PubMed and PubMed Central is staggering: as of [2018-11-27]PubMed queried in my browser, 2018-11-27: all [sb] returned a total of 29,111,301 indexed publications, with 1,287,836 publications appearing in 2017 ( 2017 [dp]; “dp” = date of publication)!, there were 29.1 million indexed publications, of which 1.288 million publications appeared in 2017!

Information overload is characterized by the difficulty of understanding an issue and effectively making decisions when one has too much information about that issue. In our infocentric world, we have an increasing dependency on relevant, accurate information that is buried in the avalanche of continuously generated information. Coincident with information overload is the phenomenon of attention overload: we have limited attention and we’re not always sure where to direct it. It can be difficult to limit how much information we consume when there’s always something new waiting for a click; before we know it, an abundance of messy and complex information has infiltrated our minds. If our processing strategies don’t keep pace, our online explorations create strained confusion instead of informed clarity. Hence, more information is not necessarily better.

• When Choice is Demotivating: Can One Desire Too Much of a Good Thing? discussed findings from 3 experimental studies that starkly challenged the implicit assumption that having more choices is more intrinsically motivating than having fewer choices. Those experiments, which were conducted in both field and laboratory settings, showed that people are more likely to make purchases or undertake optional classroom essay assignments when offered a limited array of 6 choices, rather than a more extensive array of 24 or 30 choices. Moreover, participants reported greater subsequent satisfaction with their selections and wrote better essays when their original set of options had been limited.

• Information overload is a long-standing issue: in her 2010 book Too Much To Know: Managing Scholarly Information before the Modern Age, Harvard Department of History Professor Ann Blair argued that the early modern methods of selecting, summarizing, sorting, and storing text (the 4S’s) are at the root of the techniques we use today in information management.

• For more discussion, see the first part of the blog post Information Overload, Fake News, and Invisible Gorillas  [local copy].

The construction of a well-crafted biomedical textual knowledge store (TKS) – with a focus on high quality, high impact material – partly addresses the issue of information overload. TKS provide a curated source of preselected and stored textual material, upon which programmatic approaches such as text mining and data visualization provide a more focused, deeper understanding. The application of advanced NLP and ML methods applied to TKS will assist the processing (e.g. clustering; dimensionality reduction; ranking; summarization; etc.) and understanding of PubMed/PubMed Central (PM/PMC) and other textual sources based on user-defined interests.

As well, techniques such as active learning and analyses of user query patterns and preferences could assist refined queries to external sources (PubMed and other online search engines), and the processing of those new data, in an increasingly focused iterative approach. The incorporation of vector space approaches and other “fuzzy” search paradigms could incidentally assist knowledge discovery. Algorithms and software acting as intelligent agents could automatically, autonomously and adaptively scan PM/PMC and other sources for new knowledge in multiple subject/topic areas; for example, monitoring the biomedical literature for new information relevant to genomic variants.

The new information that is retrieved from TKS may also be cross-queried against knowledge graphs to recover additional knowledge and discover new relations. The increasing availability of personalized genomic information and other personalized medical information will drive a demand for access to high quality TKS and methods to efficiently query and process those knowledge stores. Intelligently designed graphical user interfaces will allow the querying and viewing of those data (text; graphs; pathways and networks; images; etc.), per the users needs.

## Text Classification

There is an increasing need for semi-supervised and unsupervised tools that can pre-process, analyse and classify raw text to extract interpretable content; for example, identifying topics, and content-driven groupings of articles. One approach for information overload is to classify the documents, to grouping related information. Accurate document classification is a key component to ensuring quality of any digital library: the non-classification of documents impedes systems and hence users from finding useful information.

The Word Mover's Distance (WMD) – introduced in From Word Embeddings To Document Distances (2015) [codetutorialtutorial: Python implementation with WMD paper coauthor Matt Kusner)] – is a novel distance function between text documents, based on results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences.

The WMD distance measures the dissimilarity between two text documents as the minimum cumulative distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Although two documents may not share any words in common, WMD can still measure the semantical similarity by considering their word embeddings, while other bag-of-words or term frequency-inverse document frequency (TF-IDF) methods only measure the similarity by the appearance of words.

The WMD metric has no hyperparameters and is straightforward to implement. Furthermore, on eight real world document classification data sets, in comparison with seven state of the art baselines, the WMD metric demonstrated unprecedented low $\small k$-nearest neighbor document classification error rates.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In the biomedical domain, Bridging the Gap: Incorporating a Semantic Similarity Measure for Effectively Mapping PubMed Queries to Documents (published Nov 2017) presented a query-document similarity measure motivated by WMD that relied on neural word embeddings to compute the distance between words. Unlike other similarity measures, their (WMD) method relied on neural word embeddings to compute the distance between words, which helped identify related words when no direct matches were found between a query and a document (e.g., as shown in Fig. 1 in that paper).

[Image source. Click image to open in new window.]

In Representation Learning of Entities and Documents from Knowledge Base Descriptions – jointly by Studio Ousia and collaborators – the authors described TextEnt, a neural network model that learned distributed representations of entities and documents directly from a knowledge base. Given a document in a knowledge base consisting of words and entity annotations, they trained their model to predict the entity that the document described and mapped the document and its target entity close to each other in a continuous vector space. Their model ( in that paper) was trained using a large number of documents extracted from Wikipedia. TextEnt (which performed somewhat better than their Wikipedia2Vec baseline model) used the last, fully-connected layer to classify documents into a set of pretrained classes (their ).

This is similar to how the fully-connected layer in various ImageNet image classification models classify images into predefined categories – and for which, removing the last fully-connected layer enabled the use of those models for transfer learning, as described in the Stanford cs231n course page Transfer Learning  [local copy].

Recent word embedding methods such as word2vec [Efficient Estimation of Word Representations in Vector Space (Sep 2013)], introduced by Tomas Mikolov et al. at Google in 2013, are capable of learning semantic meaning and similarity between words in an entirely unsupervised manner using a contextual window, doing so much faster than previous methods. [For an interesting discussion, see Word2Vec and FastText Word Embedding with Gensim.]

[Image source. Click image to open in new window.]

While vector space representations of words succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, the origins of that success remained opaque. GloVe: Global Vectors for Word Representation (2014) [project] – by Jeffrey Pennington, Richard Socher and Christopher D. Manning at Stanford University – analyzed and made explicit the model properties needed for such regularities to emerge in word vectors. The result was a new global log-bilinear regression model (GloVe ) that combined the advantages of the two major model families in the literature: global matrix factorization, and local context window methods. GloVe efficiently leveraged statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produced a vector space with meaningful substructure, as evidenced by its performance of 75% on a word analogy task. GloVe also outperformed related models on similarity tasks and named entity recognition.

FastText  [GitHub] – a library for efficient learning of word representations and sentence classification – is an extension to word2vec that was introduced by Tomas Mikolov (now at) and colleagues at Facebook AI Research in a series of papers in 2016:

• fastText (FastText), a library for efficient learning of word representations and sentence classification, is an extension to word2vec. The model is described in Enriching Word Vectors with Subword Information (Jul 2016, updated Jun 2017).  “In this paper, we propose to learn representations for character $\small n$-grams, and to represent words as the sum of then-gram vectors.”  [See Section 3.2 in that paper.]

FastText, including character based representations, is well-described in the following blog posts.

• “Each word is represented as a bag of character $\small n$-grams in addition to the word itself, so for example, for the word matter, with $\small n = 3$, the fastText representations for the character $\small n$-grams is $\small \text{<ma, mat, att, tte, ter, er>}$.  $\small <$ and $\small >$ are added as boundary symbols to distinguish the $\small n$-gram of a word from a word itself, so for example, if the word $\small \text{mat}$ is part of the vocabulary, it is represented as $\small \text{<mat>}$. This helps preserve the meaning of shorter words that may show up as $\small n$-grams of other words. Inherently, this also allows you to capture meaning for suffixes/prefixes.”
• This two-part post [local copy]:

Character based LM.  [Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Like word2vecfastText is an unsupervised learning algorithm for obtaining vector representations for words. Whereas word2vec cannot classify text, fastText can classify text. FastText learns word embeddings in a manner very similar to word2vec except fastText enriches word vectors with subword information using character $\small n$-grams of variable length; for example, the $\small tri$-grams for the word “apple” are “app”, “ppl”, and “ple” (ignoring the starting and ending of boundaries of words). The word embedding vector for “apple” will be the sum of all of these $\small n$-grams. These character $\small n$-grams allow the algorithm to identify prefixes, suffixes, stems, and other phonological, morphological and syntactic structure in a manner that does not rely on words being used in similar context and thus being represented in similar vector space. After training rare words can now be properly represented, since it is highly likely that some of their $\small n$-grams also appear in other words. FastText represents an out-of-vocabulary medical term as the normalized sum of the vector representations of its $\small tri$-grams.

• Regarding fastText and out of vocabulary (OOV) words, note that text inputted into fastText is lowercased (affecting the embeddings).

• The effectiveness of word embeddings for downstream NLP tasks is limited by OOV words, for which embeddings do not exist. In 2017 Pinter et al. (Mimicking Word Embeddings using Subword RNNs  [code]) presented MIMICK, an approach to generating OOV word embeddings compositionally by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK did not require retraining on the original word embedding corpus; instead, learning was performed at the type level.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Sebastian Ruder also discussed OOV. Regarding Mimicking Word Embeddings using Subword RNNs, he stated:

“Another interesting approach to generating OOV word embeddings is to train a character-based model to explicitly re-create pretrained embeddings (Pinter et al., 2017). This is particularly useful in low-resource scenarios, where a large corpus is inaccessible and only pretrained embeddings are available.”

FastText was compared to other approaches on the classification of PubMed abstracts in Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research (Apr 2017), where it performed very well. Interestingly, the embeddings learned by fastText on the entire English Wikipedia worked very well in that task, indicating that the diverse topics covered by Wikipedia provided a rich corpus from which to learn text semantics. In addition, Wikipedia contains documents related to biomedical research such that the vocabulary is not as limited in regard to that domain, compared to models trained on corpora from Freebase and GoogleNews. Performance using GoogleNews embeddings was comparable to Pubmed and Pubmed+Wiki. These results suggested that learning embeddings in a domain-specific corpus is not a requirement for success in these tasks. That conclusion was echoed in A Comparison of Word Embeddings for the Biomedical Natural Language Processing (Jul 2018), which among its conclusions found that the word embeddings trained on biomedical domain corpora do not necessarily have better performance than those trained on other general domain corpora for any downstream biomedical NLP tasks.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

A recent, probabilistic extension of fastTextProbabilistic FastText for Multi-Sense Word Embeddings (Jun 2018) [code], produced accurate representations of rare, mis-spelled, and unseen words. Probabilistic FastText outperformed both fastText (which has no probabilistic model) and dictionary-level probabilistic embeddings (which do not incorporate subword structures) on several word-similarity benchmarks, including English rare word and foreign language datasets. It also achieved state of the art performance on benchmarks that measure ability to discern different meanings. The proposed model was the first to achieve multi-sense representations while having enriched semantics on rare words.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Work extending the concept of word embeddings to sentence, paragraph and document embeddings was introduced in 2014 by Quoc V. Le and Tomas Mikolov at Google as Paragraph Vectors in Distributed Representations of Sentences and Documents (May 2014) [media: A gentle introduction to Doc2Vec], commonly known as doc2vec. However, there was some controversy as to whether doc2vec could outperform centroid methods, and others struggled to reproduce those results, leading Lau and Baldwin in An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation to perform an extensive comparison between various document embedding methods across different domains. They found that doc2vec performed robustly when using models trained on large external corpora, and could be further improved by using pretrained word embeddings. The general consensus was that different methods are best suited for different tasks; for example, centroids performed well on tweets, but are outperformed on longer documents. [For good discussions of various approaches, see Document Embedding, and The Current Best of Universal Word Embeddings and Sentence Embeddings.]

[Image source. Click image to open in new window.]

Le and Mikolov’s doc2vec model was later used in a 2018 paper from Imperial College London, Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning.

More recent work on document embedding was presented in Word Mover’s Embedding: From Word2Vec to Document Embedding (Oct 2018) [data, code]. While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called Word Mover's Distance (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. This paper proposed the Word Mover’s Embedding (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In their experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matched or outperformed state of the art techniques, with significantly higher accuracy on problems of short length.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Google’s 2015 publication Semi-Supervised Sequence Learning (Nov 2015) [code] used two approaches to improve sequence learning with long short-term memory (LSTM) recurrent networks. They first predicted what came next in a sequence via conventional NLP, and then used a sequence autoencoder which read the input sequence into a vector and predicted the input sequence again. These two algorithms could be used as an unsupervised pretraining step for a later supervised sequence learning algorithm. An important result from their experiments was that using more unlabeled data from related tasks in the pretraining improved the generalization (e.g. classification accuracy) of a subsequent supervised model. This was equivalent to adding substantially more labeled data, supporting the thesis that it is possible to use unsupervised learning with more unlabeled data to improve supervised learning. They also found that after being pretrained with the two approaches, LSTM are more stable and generalize better. Thus, this paper showed that it is possible to use LSTM for NLP tasks such as document classification, and that a language model or a sequence autoencoder can help stabilize the learning in LSTM. On five benchmarks, the LSTM reached or surpassed the performance levels of all previous baselines.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In addition to the Google-provided code, the text-classification-models-tf repository also provides Tensorflow implementations of various text classification models. Another repository by that GitHub user, Transfer Learning for Text Classification with Tensorflow, provides a TensorFlow implementation of semi-supervised learning for text classification – an implementation of Google’s Semi-supervised Sequence Learning paper.

An independent implementation for vector representation of documents, Doc2VecC (note the extra “C” at the end; Efficient Vector Representation for Documents through Corruption (Jul 2017) [code]) represented each document as a simple average of word embeddings. It ensured that a representation generated as such captured the semantic meanings of the document during learning. Doc2VecC produced significantly better word embeddings than word2vec: the simple model architecture introduced by Doc2VecC matched or out-performed the state of the art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enabled training on billions of words per hour on a single machine; at the same time, the model was very efficient in generating representations of unseen documents at test time.

[Image source. Click image to open in new window.]

Le and Mikolov’s doc2vec model was used in a 2018 paper from Imperial College London, Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning (Aug 2018) [code here and here], which described a methodology that brought together powerful vector embeddings from NLP with tools from graph theory (that exploited diffusive dynamics on graphs) to reveal natural partitions across scales. Their framework used doc2vec to represent text in vector form, then applied a multi-scale community detection method (Markov Stability) to partition a similarity graph of document vectors. The method allowed them to obtain clusters of documents with similar content, at different levels of resolution, in an unsupervised manner. An analysis of a corpus of 9,000 news articles showed consistent groupings of documents according to content without a priori assumptions about the number or type of clusters to be found.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

An Analysis of Hierarchical Text Classification Using Word Embeddings (Sep 2018) studied trained machine learning classification models fastText, XGBoost, SVM, and Keras’ CNN, as well as word embedding generation methods GloVe, word2vec and fastText with publicly available data, evaluating them in the hierarchical classification task. FastText performed as the best classifier, also providing very good results as a word embedding generator despite the relatively small amount of data provided.

[Image source. Click image to open in new window.]

In other work related to sentence embeddings, Unsupervised Learning of Sentence Representations Using Sequence Consistency (Sep 2018) from IBM Research proposed a simple yet powerful unsupervised method to learn universal distributed representations of sentences by enforcing consistency constraints on sequences of tokens, applicable to the classification of text and transfer learning. Their ConsSent model was compared to unsupervised methods (including GloVe, fastText, ELMo, etc.) and supervised methods (including InferSent, etc.) on a classification transfer task in their Table 1, where ConsSent performed very well, overall.

Sentence embedding is an important research topic in NLP: it is essential to generate a good embedding vector that fully reflects the semantic meaning of a sentence in order to achieve an enhanced performance for various NLP tasks. Although two sentences may employ different words or different structures, people will recognize them as the same sentence as long as the implied semantic meanings are highly similar. Hence, a good sentence embedding approach should satisfy the property that if two sentences have different structures but convey the same meaning (i.e., paraphrase sentences), then they should have the same (or at least similar) embedding vectors.

In 2018 Myeongjun Jang and Pilsung Kang at Korea University presented Paraphrase Thought: Sentence Embedding Module Imitating Human Language Recognition (Oct 2018) [codeJupyter notebook], which introduced their P-thought  model. Inspired by human language recognition, they proposed the concept of semantic coherence, which should be satisfied for good sentence embedding methods: similar sentences should be located close to each other in the embedding space.

P-thought was designed as a dual generation model, which received a single sentence as input and generated both the input sentence and its paraphrase sentence, simultaneously. Given a (sentence, paraphrase) sentence tuple, it should be possible to generate both the sentence itself and its paraphrase sentence from the representation vector of an input sentence. For the P-thought model, they employed a seq2seq structure with a gated recurrent unit (GRU) cell. The encoder transformed the sequence of words from an input sentence into a fixed-sized representation vector, whereas the decoder generated the target sentence based on the given sentence representation vector. The P-thought model had two decoders: when the input sentence was given, the first decoder, named “auto-decoder,” generated the input sentence as is. The second decoder, named “paraphrase-decoder,” generated the paraphrase sentence of the input sentence.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

P-thought pursued maximal semantic coherence during training. Compared to a number of baselines (bag of words, Sent2Vec, etc. in their ) on the MS-COCO datasetInferSent and P-thought far surpassed the other models, with P-thought slightly outperforming InferSent. In the case of P-thought with a one-layer Bi-RNN, the P-coherence value was comparable to that of InferSent (0.7454 and 0.7432, respectively); P-thought with a two layer forward RNN gave a score of 0.7899.

• Whereas P-thought with a two layer Bi-RNN gave a much higher P-coherence score (0.9725), this was an over-training artefact. The main limitation of that work was that there were insufficient paraphrase sentences for training the models: P-thought models with more complex encoder structures tended to overfit the MS-COCO datasets. Although this problem could be resolved by acquiring more paraphrase sentences, it was not easy for these authors to obtain a large number of paraphrase sentences. [In that regard, note my comments (above) on the DuoRC dataset, over which P-thought could be trained.]

In Google Brain/OpenAI’s Adversarial Training Methods for Semi-Supervised Text Classification (May 2017) [code | non-author codemedia], adversarial training provided a means of regularizing supervised learning algorithms while virtual adversarial training was able to extend supervised learning algorithms to the semi-supervised setting. However, both methods required making small perturbations to numerous entries of the input vector, which was inappropriate for sparse high-dimensional inputs such as one-hot word representations. The authors extended adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieved state of the art results on multiple benchmark semi-supervised and purely supervised tasks: the learned word embeddings were of higher quality, and the model was less prone to overfitting while training.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Hierarchical approaches to document classification include HDLTex: Hierarchical Deep Learning for Text Classification (Oct 2017) [code]. Recently the performance of traditional supervised classifiers has degraded as the number of documents has increased, because accompanying the growth in the number of documents is an increase in the number of categories. This paper approached this problem differently from current document classification methods that viewed the problem as multiclass classification, instead performing “hierarchical classification.” Traditional multi-class classification techniques work well for a limited number classes, but performance drops with increasing number of classes, as is present in hierarchically organized documents. Hierarchical deep learning solves this problem by creating neural architectures that specialize deep learning approaches for their level of the document hierarchy. HDLTex employed stacks of deep learning architectures (RNN, CNN) to provide specialized understanding at each level of the document hierarchy. Testing on a data set of documents obtained from the Web of Science showed that combinations of RNN at the higher level and DNN or CNN at the lower level produced accuracies consistently higher than those obtainable by conventional approaches using naïve Bayes or a support vector machine (SVM).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Hierarchical Attention Networks for Document Classification (2016; alternate link) [non-author code hereherehere and here] – by authors at Carnegie Mellon University and Microsoft Research – described a hierarchical attention network for document classification. Their model had two distinctive characteristics: a hierarchical structure that mirrored the hierarchical structure of documents, and it had two levels of attention mechanisms (word-level and sentence-level), enabling it to attend differentially to more and less important content when constructing the document representation. Experiments on six large scale text classification tasks demonstrated that the proposed architecture outperformed previous methods by a substantial margin. Visualization of the attention layers illustrated that the model selected qualitatively informative words and sentences.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Note also . Click image to open in new window.]

Recent work (2018) from Stanford University (Training Classifiers with Natural Language Explanations (Aug 2018) [codeworksheetauthor’s blog postdemo videodiscussion]) proposed a framework, BabbleLabble, for training classifiers in which human annotators provided natural language explanations for each labeling decision. A semantic parser converted those explanations into programmatic labeling functions that generated noisy labels for an arbitrary amount of unlabeled data, which were used to train a classifier. On three relation extraction tasks, users were able to train classifiers with comparable $\small F_1$ scores 5-100x faster by providing explanations instead of just labels. Based on $\small F_1$ scores, a classifier trained with BabbleLabble achieved the same accuracy as a classifier trained only with end labels, using between 5x to 100x fewer examples. On the spouse task, 30 explanations were worth around 5,000 labels; on the disease task 30 explanations were worth around 1,500 labels; and on the protein task, 30 explanations were worth around 175 labels.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• This project is part of the Snorkel project (a training data creation and management system focused on information extraction), the successor to the now deprecated DeepDive project. Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Document relevance ranking, also known as ad-hoc retrieval (Harman, 2005), is the task of ranking documents from a large collection using only the query and the text of each document only. This contrasts with standard information retrieval (IR) systems that rely on text based signals in conjunction with network structure and/or user feedback. Text-based ranking is particularly important when click-logs do not exist or are small, and the network structure of the collection is non-existent or not informative for query-focused relevance. Examples include various domains in digital libraries, e.g. patents or scientific literature (Wu et al., 2015Tsatsaronis et al., 2015), enterprise search, and personal search.

Deep Relevance Ranking Using Enhanced Document-Query Interactions (Sep 2018) [code] by Athens University of Economics and Business and Google AI (2018) explored several new models for document relevance ranking, building upon the Deep Relevance Matching Model (DRMM) of Guo et al. (2016). Unlike DRMM, which used context-insensitive encodings of terms and query-document term interactions, they injected rich context-sensitive encodings throughout their models, extended in several ways including multiple views of query and document inputs. DRMM outperformed BM25-based baseline models on datasets from the BIOASQ question answering challenge, and TREC ROBUST 2004.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Graph-based Deep-Tree Recursive Neural Network (DTRNN) for Text Classification (Sep 2018) employed a graph representation learning approach to text classification, where the text was provided as nodes in a graph. First, their novel graph-to-tree conversion mechanism, deep-tree generation (DTG) predicted the textual data represented by graphs, generating a richer and more accurate representation for the nodes (vertices). DTG added flexibility in exploring the vertex neighborhood information, better reflecting the second order proximity and homophily [the tendency of similar people/objects to group together] equivalence in a graph. Then, a Deep-Tree Recursive Neural Network (DTRNN) method was used to classify vertices that contained text data in graphs. The model captured the neighborhood information of a node better than the traditional breath first search tree generation method. Experimental results on three citation datasets proved the effectiveness of the proposed DTRNN method, giving state of the art classification accuracy for graph structured text. They also trained graph data in the DTRNN by adding more attention models; however, the attention mechanism did not give better accuracy, because the DTRNN algorithm alone already captured more features of each node.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Graph Convolutional Networks for Text Classification (Oct 2018) proposed the use of graph convolutional networks (GCN) for text classification. They built a single text graph for a corpus based on word co-occurrence and document word relations, then learned a Text GCN for the corpus. Their Text GCN was initialized with one-hot representation for words and documents; it then jointly learned the embeddings for both words and documents, supervised by the known class labels for the documents. Experimental results on multiple benchmark datasets demonstrated that a vanilla Text GCN without any external word embeddings or knowledge outperformed state of the art methods for text classification. Text GCN also learned predictive word and document embeddings. Additionally, experimental results showed that the improvement of Text GCN over state of the art comparison methods became more prominent as the percentage of training data was lowered, suggesting the robustness of Text GCN to less training data in text classification.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Text Classification:

• “Building explainable systems is a critical problem in the field of Natural Language Processing (NLP), since most machine learning models provide no explanations for the predictions. Existing approaches for explainable machine learning systems tend to focus on interpreting the outputs or the connections between inputs and outputs. However, the fine-grained information is often ignored, and the systems do not explicitly generate the human-readable explanations. To better alleviate this problem, we propose a novel generative explanation framework that learns to make classification decisions and generate fine-grained explanations at the same time. More specifically, we introduce the explainable factor and the minimum risk training approach that learn to generate more reasonable explanations. We construct two new datasets that contain summaries, rating scores, and fine-grained reasons. We conduct experiments on both datasets, comparing with several strong neural network baseline systems. Experimental results show that our method surpasses all baselines on both datasets, and is able to generate concise explanations at the same time.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• “Extreme multi-label text classification (XMTC) is a task for tagging each given text with the most relevant multiple labels from an extremely large-scale label set. This task can be found in many applications, such as product categorization,web page tagging, news annotation and so on. Many methods have been proposed so far for solving XMTC, while most of the existing methods use traditional bag-of-words (BOW) representation, ignoring word context as well as deep semantic information. XML-CNN, a state-of-the-art deep learning-based method, uses convolutional neural network (CNN) with dynamic pooling to process the text, going beyond the BOW-based approaches but failing to capture 1) the long-distance dependency among words and 2) different levels of importance of a word for each label. We propose a new deep learning-based method, AttentionXML, which uses bidirectional long short-term memory (LSTM) and a multi-label attention mechanism for solving the above 1st and 2nd problems, respectively. We empirically compared AttentionXML with other six state-of-the-art methods over five benchmark datasets. AttentionXML outperformed all competing methods under all experimental settings except only a couple of cases. In addition, a consensus ensemble of AttentionXML with the second best method, Parabel, could further improve the performance over all five benchmark datasets.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Text Summarization

Approximately 1.28 million articles were added to PubMed in 2017, including ~0.36 million full-text articles added to PubMed Central, at the rate of ~3,485 new articles per day (queried 2018-06-29; see also my blog post).

• Of those, ~122,381 included the word “cancer” in the title or abstract, i.e. ~335 papers/day (PubMed query 2017[dp] AND cancer[tiab] executed 2018-06-29; note the capitalized Boolean).

• Narrowing the search to 2017[dp] AND 'breast cancer'[tiab] or 2017[dp] AND 'acute myeloid leukemia'[tiab] returned 16,706 and 2,030 articles (45.77 and 5.56 articles/day), respectively.

The following command-line query shows the numbers of PubMed publications per indicated year (queried on the indicated date: PubMed continually adds older, previously non-indexed articles):

With those data in ~/pm.dat, executing gnuplot -p -e 'plot "~/pm.dat" notitle' gives this plot:

Those data show a linear increase in growth over a ~13 year span (ca. 2002-2015), tapering recently. Contrary to numerous assertions in various research papers and the media, there is no exponential growth in this literature – nevertheless, the output is staggering.

Accurate text summarization is needed to address, in part, the information overload arising from the enormous volume and overall growth of the PM/PMC biomedical literature. Text summarization generally falls into one of two categories:

• extractive summarization, which summarizes text by copying parts of the input, and

• abstractive summarization, which generates new phrases (possibly rephrasing or using words that were not in the original text).

Abstractive summarization tends to be more concise than extractive summarization (which tends to be more repetitive and burdened with non-relevant text). However, extractive summarization is much easier to implement, and can provide unaltered evidentiary snippets of text.

### Extractive Summarization

The Word Mover's Distance (WMD) was applied the extractive summarization task in Efficient and Effective Single-Document Summarizations and A Word-Embedding Measurement of Quality (Oct 2017). WMD uses word2vec as a word embedding representation method. WMD measures the dissimilarity between two documents and calculates the minimum cumulative distance to “travel” from the embedded words of one document to the other.

[Image source. Click image to open in new window.]

WMD has also been used in:

Likewise, Data-driven Summarization of Scientific Articles (Apr 2018) [datasetsslides] applied WMD to the biomedical domain, comparing that approach to others in a very interesting and revealing study on extractive and abstractive summarization. The examples presented in their Tables 4 and 5 demonstrated very clearly the differences in extractive summarization over full length articles, for title and abstract generation from the full length texts. While the results for title generation were promising, the models struggled with generating the abstract, highlighting the necessity for developing novel models capable of efficiently dealing with long input and output sequences, while at the same time preserving the quality of generated sentences.

[Image source. Click image to open in new window.]

[Note: partial image (continues over two PDF pp.). Image source. Click image to open in new window.]

Ranking Sentences for Extractive Summarization with Reinforcement Learning (Apr 2018) [codelive demo] conceptualized extractive summarization as a sentence ranking task. While many extractive summarization systems are trained using cross entropy loss in order to maximize the likelihood of the ground-truth labels, they do not necessarily learn to rank sentences based on their importance due to the absence of a ranking-based objective. [Cross entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.]

• In this paper the authors argued that models trained on cross entropy training are prone to generating verbose summaries with unnecessarily long sentences and redundant information. They proposed overcoming these difficulties by globally optimizing the ROUGE evaluation metric and learning to rank sentences for summary generation through a reinforcement learning objective.

Their neural summarization model, REFRESH (REinFoRcement Learning-based Extractive Summarization) consisted of a hierarchical document encoder and a hierarchical sentence extractor. During training, it combined the maximum-likelihood cross entropy loss with rewards from policy gradient reinforcement learning to directly optimize the evaluation metric relevant for the summarization task. The model was applied to the CNN and DailyMail datasets on which it outperformed baseline, state of the art extractive and abstractive systems when evaluated automatically and by humans. They showed that their global optimization framework rendered extractive models better at discriminating among sentences for the final summary, and that the state of the art abstractive systems evaluated lagged behind the extractive ones, when the latter are globally trained.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Iterative Document Representation Learning Towards Summarization with Polishing (Sep 2018) [code] introduced Iterative Text Summarization (ITS), an iteration based model for supervised extractive text summarization, inspired by the observation that it is often necessary for a human to read an article multiple times in order to fully understand and summarize its contents. Current summarization approaches read through a document only once to generate a document representation, resulting in a sub-optimal representation. To address this issue they introduced a model which iteratively polished the document representation on many passes through the document. As part of their model, they also introduced a selective reading mechanism that decided more accurately the extent to which each sentence in the model should be updated. Experimental results on the CNN/DailyMail and DUC2002 datasets demonstrated that their model significantly outperformed state of the art extractive systems when evaluated by machines and by humans.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Comparing tables in those respective papers (above), note that ITS (Iterative Document Representation Learning Towards Summarization with Polishing) outperformed REFRESH (Ranking Sentences for Extractive Summarization with Reinforcement Learning).

[Click image to open in new window.]

Extractive Summarization:

• “We showed the use of a deep LSTM based model in a sequence learning problem to encode sentences with common semantic information to similar vector representations. The presented latent representation of sentences has been shown useful for sentence paraphrasing and document summarization. We believe that reversing the encoder sentences helped the model learn long dependencies over long sentences. One of the advantages of our simple and straightforward representation is the applicability into a variety of tasks. Further research in this area can lead into higher quality vector representations that can be used for more challenging sequence learning tasks.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• “This paper discusses some central caveats of summarisation, incurred in the use of the ROUGE metric for evaluation, with respect to optimal solutions. The task is NP-hard, of which we give the first proof. Still, as we show empirically for three central benchmark datasets for the task, greedy algorithms empirically seem to perform optimally according to the metric. Additionally, overall quality assurance is problematic: there is no natural upper bound on the quality of summarisation systems, and even humans are excluded from performing optimal summarisation.”

• “Extractive summarization is very useful for physicians to better manage and digest Electronic Health Records (EHRs). However, the training of a supervised model requires disease-specific medical background and is thus very expensive. We studied how to utilize the intrinsic correlation between multiple EHRs to generate pseudo-labels and train a supervised model with no external annotation. Experiments on real-patient data validate that our model is effective in summarizing crucial disease-specific information for patients.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

#### Probing the Effectiveness of Extractive Summarization

Content Selection in Deep Learning Models of Summarization (Oct 2018) [code] experimented with deep learning models of summarization across the domains of news, personal stories, meetings, and medical articles in order to understand how content selection was performed. They found that many sophisticated features of state of the art extractive summarizers did not improve performance over simpler models, suggesting that it is easier to create a summarizer for a new domain than previous work suggests, and bringing into question the benefit of deep learning models for summarization for those domains that do have massive datasets (i.e., news). At the same time, they suggested important questions for new research in summarization; namely, new forms of sentence representations, or external knowledge sources are needed that are better suited to the summarization task.

[Image source. Click image to open in new window.]

PubMed. We created a corpus of 25,000 randomly sampled medical journal articles from the PubMed Open Access Subset 6 . We only included articles if they were at least 1000 words long and had an abstract of at least 50 words in length. We used the article abstracts as the ground truth human summaries.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

### Abstractive Summarization

In Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond (Aug 2016) [non-author code], researchers at IBM Watson (Ramesh Nallapati et al.) and the Université de Montréal (Caglar Gulcehre) modeled abstractive text summarization using attentional encoder-decoder recurrent neural networks.

[Image source. Click image to open in new window.]

That approach was extended by Richard Socher and colleagues at SalesForce in A Deep Reinforced Model for Abstractive Summarization (Nov 2017), which described a sophisticated, highly performant reinforcement learning-based system for abstractive text summarization that set the state of the art in this domain, circa mid-2017:

Socher’s work also used an attention mechanism and a new machine learning objective to address the “repeating phrase” problem, via:

• an intra-temporal attention mechanism in the bidirectional long short-term memory (Bi-LSTM) encoder that recorded previous attention weights for each of the input tokens (words), while a sequential intra-attention model in the LSTM decoder took into account which words had already been generated by the decoder – i.e., an encoder-decoder network; and,

• a new objective function that combined the maximum-likelihood cross entropy loss, used in prior work with rewards from policy gradient reinforcement learning, to reduce exposure bias.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

The encoder-decoder employed in Socher's work allowed the model to generate new words that were not part of the input article, while the copy-mechanism allowed the model to copy over important details from the input even if these symbols were rare in the training corpus. At each decoding step the intra-temporal attention function attended over specific parts of the encoded input sequence in addition to the decoder’s own hidden state and the previously-generated word. This kind of attention prevented the model from attending over the same parts of the input on different decoding steps. Intra-temporal attention could also reduce the amount of repetition when attending over long documents.

• While this intra-temporal attention function ensured that different parts of the encoded input sequence were used, their decoder could still generate repeated phrases based on its own hidden states, especially when generating long sequences. To prevent that, the authors incorporated more information about the previously decoded sequence into the decoder. To generate a token [i.e. word], the decoder used either a token-generation softmax layer or a pointer mechanism to copy rare or unseen tokens from the input sequence. [In this regard, note that the probabilistic fastText algorithm could also deal with rare and out-of-vocabulary (OOV) words.] A switch function decided at each decoding step whether to use the token generation, or the pointer.

• A proprietary system, code for this work is not available, but there are four Python implementations available on GitHub (keyphrase search “Deep Reinforced Model for Abstractive Summarization”), as well as an OpenNMT implementation, that also links to GitHub.

Follow-on work (2018) by Socher and colleagues, Improving Abstraction in Text Summarization (Aug 2018), proposed two techniques to improve the level of abstraction of generated summaries. First, they decomposed the decoder into a contextual network that retrieved relevant parts of the source document, and a pretrained language model that incorporated prior knowledge about language generation. The decoder generated tokens by interpolating between selecting words from the source document via a pointer network as well as selecting words from a fixed output vocabulary. The contextual network had the sole responsibility of extracting and compacting the source document, whereas the language model was responsible for the generation of concise paraphrases. Second, they proposed a novelty metric that was optimized directly through policy learning (a reinforcement learning reward) to encourage the generation of novel phrases (summary abstraction).

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In related work – described by Junyang Lin et al. in Global Encoding for Abstractive Summarization (Jun 2018) [code] – researchers at Peking University developed a model with an encoder similar to that employed in the Socher/SalesForce approach (above), employing a Bi-LSTM decoder that generated summary words. Their approach differed from Socher’s method [not cited] in that Lin et al. fed their encoder output at each time step into a convolutional gated unit, that with a self-attention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global source-side information. Self-attention encouraged the model to learn long-term dependencies, without creating much computational complexity. Since the convolutional module could extract n-gram features of the whole source text and self-attention learned the long-term dependencies among the components of the input source text, the gate (based on the generation from the CNN and self-attention module for the source representations from the RNN encoder) could perform global encoding on the encoder outputs. Based on the output of the CNN and self-attention, the logistic sigmoid function outputted a vector of value between 0 and 1 at each dimension. If the value was close to 0, the gate removed most of the information at the corresponding dimension of the source representation, and if it was close to 1 it reserved most of the information. The model thus performed neural abstractive summarization through a global encoding framework, which controlled the information flow from the encoder to the decoder based on the global information of the source context, generating summaries of higher quality while reducing repetition.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Christopher Manning’s group at Stanford University, in collaboration with Google Brain, also employed pointer-generator networks (used by Socher/Salesforce, above) in their well-cited abstractive summarization method, Get to the Point: Summarization with Pointer-Generator Networks. Coauthor Abigail See discussed this work in her excellent post Taming Recurrent Neural Networks for Better Summarization. This approach first used a hybrid pointer-generator network which could copy words from the source text via pointing, that aided accurate reproduction of information while retaining the ability to produce novel words through the generator. The approach then used “coverage” to keep track of what had been summarized, which discouraged repetition. [“Coverage” refers to the [see Tu et al., Modeling Coverage for Neural Machine Translation (Aug 2016)], which keeps track of the attention history.]

Although it is a commercial (closed source) project, Primer.ai’s August 2018 blog article Machine-Generated Knowledge Bases introduced an abstractive summarization approach that was applied to create “missing” biographies that should exist in Wikipedia, including an interesting product demonstration. That approach could assist with addressing the information overload associated with the volume of the PubMed/PubMed Central literature; the tools they used (TLDR: Re-Imagining Automatic Text SummarizationBuilding seq-to-seq Models in TensorFlow (and Training Them Quickly] could be implemented relatively easily via the approaches described in this REVIEW. Much of that work, for example, is based on See and Manning’s Get To The Point: Summarization with Pointer-Generator Networks approach, discussed in the preceding paragraph; consequently, approaches to reimplement/extend Primer.ai’s Quicksilver abstractive summarization project via seq2seq models with attention are well in hand.

A very interesting and promising project from Google Brain, Generating Wikipedia by Summarizing Long Sequences (Jan 2018) [codeOpenReviewmedia], considered English Wikipedia as a supervised machine learning task for multi-document summarization where the input was comprised of a Wikipedia topic (title of article) and a collection of non-Wikipedia reference documents, and the target was generating the Wikipedia article text. They described the first attempt to abstractively generate the first section (lead) of Wikipedia articles conditioned on reference text. They used extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. In addition to running strong baseline models on the task, they modified their Transformer architecture to only consist of a decoder, which performed better in the case of longer input sequences compared to RNN and Transformer encoder-decoder models. They showed that their modeling improvements allowed them to generate entire Wikipedia articles.

[Image source.  In NLP, perplexity refers to the prediction capability of the language model: if the model is less perplexed, then it is a good model.  Click image to open in new window.]

[Image source.  In NLP, perplexity refers to the prediction capability of the language model: if the model is less perplexed, then it is a good model.  Click image to open in new window.]

[Image source. Click image to open in new window.]

Another very interesting paper, Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization (August 2018) [dataset, code, and demo] introduced extreme summarization (XSum), a new single-document summarization task which did not favor extractive strategies and called for an abstractive modeling approach. The idea was to create a short, one-sentence news summary answering the question “What is the article about?”. Their novel abstractive model, conditioned on the article’s topics, was based entirely on CNN. They demonstrated that this architecture captured long-range dependencies in a document and recognized pertinent content, outperforming an oracle extractive system and state of the art abstractive approaches when evaluated automatically, and by humans. The example illustrated in their Fig. 1 is very impressive, indeed (note. e.g., the substitution of “a small recreational plane” with “light”):

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

As can be seen in the image above, the summary is very different from a headline; it drew on information interspersed in various parts of the document and displays multiple levels of abstraction including paraphrasing, fusion, synthesis, and inference. That work built upon a dataset for the proposed task by harvesting online news articles that often included a first-sentence summary. They further proposed a novel deep learning model for the extreme summarization task: unlike most existing abstractive approaches which relied on RNN based encoder-decoder architectures, they presented a topic-conditioned neural model which was based entirely on CNN. Convolution layers captured long-range dependencies between words in the document more effectively than RNN, allowing it to perform document-level inference, abstraction, and paraphrasing. The convolutional encoder associated each word with a topic vector (capturing whether it was representative of the document’s content), while the convolutional decoder conditioned each word prediction on a document topic vector.

Recently, neural abstractive text summarization with sequence-to-sequence (seq2seq) models have gained in popularity. Generally speaking, most of these techniques differ in one of three categories: network structure, parameter inference, and decoding/generation. Other concerns include efficiency and parallelism for training a model. Neural Abstractive Text Summarization with Sequence-to-Sequence Models (Dec 2018) [code] provided a comprehensive literature and technical survey on different seq2seq models for abstractive text summarization from viewpoint of network structures, training strategies, and summary generation algorithms. As many models were first proposed for language modeling and generation tasks such as machine translation – and later applied to abstractive text summarization – they also provided a brief review of those models. As part of their survey, they also developed an open source library, namely Neural Abstractive Text Summarizer (NATS) toolkit, for abstractive text summarization. NATS is equipped with several important features, including attention, pointing mechanism, repetition handling, and beam search. Experiments on the CNN/Daily Mail dataset examined the effectiveness of several different neural network components. Finally, they benchmarked two models implemented in NATS on two recently released datasets: Newsroom, and Bytecup.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Abstractive Summarization:

• “In this study, we constructed a dataset focused on summaries with three sentences. We annotated and analyzed the structure of the summaries in the considered dataset. In particular, we proposed a structure-aware summarization model combining the summary structure classification model and summary-specific summarization sub-models. Through our experiment, we demonstrated that our proposed model improves summarization performance over the baseline model.”

• “In this work, we used a bidirectional encoder-decoder architecture; each of which is a bidirectional recurrent neural network consists of two recurrent layers, one for learning history textual context and the other for learning future textual context. The output of the forward encoder was fed as input into the backward decoder while the output of the backward encoder was fed into the forward decoder. Then, a bidirectional beam search mechanism is used to generate tokens for the final summary one at a time. The experimental results have shown the effectiveness and the superiority of the proposed model compared to the state of the art models. Even though the pointer-generator network has alleviated the OOV problem, finding a way to tackle the problem while encouraging the model to generate summaries with more novelty and high level of abstraction is an exciting research problem. Furthermore, we believe that there is a real need to propose an evaluation metric besides ROUGE to optimize on summarization models, especially for long sequences.”

• Abstractive Summarization Using Attentive Neural Techniques (Oct 2018) modified and optimized a translation model with self-attention for generating abstractive sentence summaries. The effectiveness of this base model along with attention variants was compared and analyzed in the context of standardized evaluation sets and test metrics. However, those metrics were found to be limited in their ability to effectively score abstractive summaries, and the authors proposed a new approach, based on the intuition that an abstractive model requires an abstractive evaluation.

• “To improve the quality of summary evaluation, we introduce the “VERT” metric [GitHub], an evaluation tool that scores the quality of a generated hypothesis summary as compared to a reference target summary. …

• “The effect of modern attention mechanisms as applied to sentence summarization has been tested and analyzed. We have shown that a self-attentive encoder-decoder can perform the sentence summarization task without the use of recurrence or convolutions, which are the primary mechanisms in state of the art summarization approaches today. An inherent limitation of these existing systems is the computational cost of training associated with recurrence. The models presented can be trained on the full Gigaword dataset in just 4 hours on a single GPU. Our relative dot-product self-attention model generated the highest quality summaries among our tested models and displayed the ability of abstracting and reducing complex dependencies. We also have shown that n- gram evaluation using ROUGE metrics falls short in judging the quality of abstractive summaries. The VERT metric has been proposed as an alternative to evaluate future automatic summarization based on the premise that an abstractive summary should be judged in an abstractive manner.”

• “We address the problem of abstractive summarization in two directions: proposing a novel dataset and a new model. First, we collect Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit. We use such informal crowd-generated posts as text source, because we empirically observe that existing datasets mostly use formal documents as source text such as news articles; thus, they could suffer from some biases that key sentences usually located at the beginning of the text and favorable summary candidates are already inside the text in nearly exact forms. Such biases can not only be structural clues of which extractive methods better take advantage, but also be obstacles that hinder abstractive methods from learning their text abstraction capability. Second, we propose a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi-level memory to store the information of text from different levels of abstraction. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the Reddit TIFU dataset is highly abstractive and the MMN outperforms the state-of-the-art summarization models.”

• “Most abstractive summarization methods employ sequence-to-sequence (seq2seq ) models where an RNN encoder embeds an input document and another RNN decodes a summary sentence. Our MMN has two major advantages over seq2seq-based models. First, RNNs accumulate information in a few fixed-length memories at every step regardless of the length of an input sequence, and thus may fail to utilize far-distant information due to vanishing gradient. … Second, RNNs cannot build representations of different ranges, since hidden states are sequentially connected over the whole sequence. This still holds even with hierarchical RNNs that can learn multiple levels of representation. In contrast, our model exploits a set of convolution operations with different receptive fields; hence, it can build representations of not only multiple levels but also multiple ranges (e.g. sentences, paragraphs, and the whole document). …”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]