### These Contents

Although multitask learning and transfer learning have similarities, they are not the same. Transfer learning only aims at achieving high performance in the target task by transferring knowledge from the source task, while multitask learning tries to learn the target and the source tasks simultaneously. For an excellent overview of transfer learning see Sebastian Ruder’s Transfer LearningMore Effective Transfer Learning for NLP also provides a concise introduction to language models and transfer learning. For a review of deep transfer learning, see A Survey on Deep Transfer Learning.

Transfer learning (a subfield of which is domain adaptation) is the reuse of a pretrained model on a new problem. In transfer learning, the knowledge of an already trained machine learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a network has learned at Task A to a new Task B. The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data. Instead of starting the learning process from scratch, you start from patterns that have been learned from solving a related task.

A 2016 paper by Socher and colleagues (SalesForce), A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks  [OpenReviewcodenon-author code; online demo (not working, 2018-11-13)], described a multitask NLP model. In that earlier work they introduced a joint many-task model together with a strategy for successively growing its depth (stacked Bi-LSTM layers: POS tagging in the first Bi-LSTM layer; word chunking in layer 2; dependency parsing in layer 3; semantic relatedness and textual entailment in layers 4 and 5, respectively) to solve increasingly complex tasks. Higher layers included shortcut connections – i.e., residual layers: see Deep Residual Learning for Image Recognition – to lower-level task predictions to reflect linguistic hierarchies. They used a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.

• Their single end-to-end model obtained state of the art or competitive results [late 2016] on five different tasks from tagging, chunking, parsing, relatedness, and entailment tasks. The inference procedure of their model began at the lowest level and worked upward to higher layers and more complex tasks; their model handled the five different tasks in the order of POS tagging, chunking, dependency parsing, semantic relatedness, and textual entailment. Their hypothesis was that it was important to start from the lower-level and gradually move to higher-level tasks. For example, introducing multi-task syntactic supervision (e.g. POS: part-of-speech tags) at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Table 1 in A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks shows the very interesting results on the test sets for the five tasks (tagging; chunking; parsing; relatedness; entailment) described in that paper.

[Image source. Click image to open in new window.]

• The column “Single” showed the results of handling each task separately using single-layer Bi-LSTMs, while the column “JMT all” showed the results of their Joint Many-Task (JMT) model. The single task settings used only the annotations of their own tasks; for example, when handling dependency parsing as a single task the POS and chunking tags were not used. All results of the five tasks were improved in the JMT model, which showed that model could handle the five different tasks in a single model. The JMT model also allowed the authors to access arbitrary information learned from the different tasks. To use the model just as a POS tagger, one could use only the first Bi-LSTM layer.

• also showed the results of five subsets of the different tasks. For example, in the case of “JMT ABC”, only the first three layers of the Bi-LSTMs were used to handle the three tasks. In the case of “JMT DE”, only the top two layers were used as a two-layer Bi-LSTM, by omitting all information from the first three layers. The results of the closely-related tasks (“AB”, “ABC”, and “DE”) showed that the JMT model improved both of the high-level and low-level tasks. The results of “JMT CD” and “JMT CE” showed that the parsing task could be improved by the semantic tasks.

• Other results suggested that the Bi-LSTMs efficiently captured global information necessary for dependency parsing. Moreover, their single task [dependency parsing] result already achieved high accuracy, without the POS and chunking information. Those insights also provided insight into the effectiveness of the layers in ELMo’s bidirectional language model (biLM). Section 5.3 in that paper explored the different types of contextual information captured in biLMs and used two intrinsic evaluations to show that syntactic information was better represented at lower layers, while semantic information was captured at higher layers.

Computer vision has benefited from initializing multiple deep layers with weights pretrained on large supervised training sets like ImageNet, whereas NLP typically sees initialization of only the lowest layer of deep models with pretrained word vectors. A 2017 paper by Socher and colleagues (SalesForce), Learned in Translation: Contextualized Word Vectors (Jun 2018) [codeauthor’s blog] introduced an approach (CoVe : context vectors) to transferring knowledge from an encoder pretrained on machine translation to a variety of downstream NLP tasks. CoVe used a deep LSTM encoder from an attentional seq2seq model trained for machine translation to contextualize word vectors. Adding those context vectors to CoVe improved performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis, question classification, entailment, and question answering. The ability to share a common representation of words in the context of sentences that includes them could further improve transfer learning in NLP.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In regard to multitask learning, made use of a novel dual coattention and multi-pointer-generator decoder to multitask across all tasks in decaNLP. The Natural Language Decathlon: Multitask Learning as Question Answering demonstrated that jointly training a MQAN on all tasks with the right anti-curriculum strategy could achieve performance comparable to that of ten separate MQAN, each trained separately. As stated in the paper (paraphrased here):

"... multitask learning has been successful when models are able to capitalize on relatedness amongst tasks while mitigating interference from dissimilarities. When tasks are sufficiently related, they can provide an inductive bias that forces models to learn more generally useful representations. By unifying tasks under a single perspective, it is possible to explore these relationships. "

The “decaNLP/MQAN” paper offered some insight into how MQAN attained robust multitask learning. Some key observations from the paper in this regard include the following:

• MQAN was able to perform nearly as well or better in the multitask setting as in the single task setting for each task despite being capped at the same number of trainable parameters in both (see in that paper). A collection of MQAN trained for each task individually would use far more trainable parameters than a single MQAN trained jointly on decaNLP. This suggests that MQAN successfully uses trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limits catastrophic forgetting.

• MQAN trained on decaNLP learns to generalize beyond the specific domains for any one task, while also learning representations that make learning completely new tasks easier.

• In an analysis of how MQAN chose to output answer words:  “The models reliance on the question pointer for SST [the Stanford Sentiment Treebank, i.e. the “sentiment analysis” task] (see ) allows it to copy different, but related class labels with little confusion. This suggests these multitask models are more robust to slight variations in questions and tasks and can generalize to new and unseen classes. These results demonstrate that models trained on decaNLP have potential to simultaneously generalize to out-of-domain contexts and questions for multiple tasks and even adapt to unseen classes for text classification. This kind of zero-shot domain adaptation in both input and output spaces suggests that the breadth of tasks in decaNLP encourages generalization beyond what can be achieved by training for a single task.”

• MQAN trained on decaNLP is the first, single model to achieve reasonable performance on such a wide variety of complex NLP tasks without task-specific modules or parameters, with little evidence of catastrophic interference, and without parse trees, chunks, POS tags, or other intermediate representations.

• Appendix D discusses the round-robin curriculum learning: “Our results demonstrate that training the MQAN jointly on all tasks with the right anti-curriculum strategy can achieve performance comparable to that of ten separate MQANs, each trained separately.” [See and Section 4.2 in that paper.] Finding that anti-curriculum learning benefited models in the decaNLP also validated intuitions outlined in [Caruana, 1997]: tasks that are easily learned may not lead to development of internal representations that are useful to other tasks. Our results actually suggest a stronger claim: including easy tasks early on in training makes it more difficult to learn internal representations that are useful to other tasks. [ … snip … ] By ordering the tasks differently, it is possible to improve performance on some of the tasks but that improvement is not without a concomitant drop in performance for others.”

Like its application to multitask learning, MQAN pretrained on decaNLP also showed improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. Though not explicitly designed for any one task, MQAN proved to be a strong model in the single-task setting as well, achieving state of the art results on the semantic parsing component of decaNLP (i.e., transfer learning). Section 5 of that paper summarized the state of the art (mid-2018) for transfer learning in NLP (paraphrased here):

• “Most success in making use of the relatedness between natural language tasks stem from transfer learning. Word2vec, skip-thought vectors and GloVe yield pretrained embeddings that capture useful information about natural language. The embeddings, intermediate representations, and weights of language models can be transferred to similar architectures and classification tasks. Intermediate representations from supervised machine translation models improve performance on question answering, sentiment analysis, and natural language inference. Question answering datasets support each other as well as entailment tasks, and high-resource machine translation can support low-resource machine translation. This work shows that the combination of MQAN and decaNLP makes it possible to transfer an entire end-to-end model that can be adapted for any NLP task cast as question answering.

In transfer learning, virtually all systems trained on one dataset have trouble when applied to datasets that differ even slightly: even switching from Wall Street Journal (WSJ) text to New York Times text can slightly hurt parsing performance. It is anticipated that the recent availability of pretrained language models will improve transfer and multi-task learning. ELMo and Finetuned Transformer LM represent language models in the multitask learning domain, while ULMFiT represents a language model in the transfer learning domain (where the LM is first trained on a general-domain corpus to capture general features of the language in different layers, then fine-tuned on the target task). A more recent model, BERT, also shows great promise.

Regarding ELMo, a very interesting paper from the Allen Institute for AI, Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (May 2018) [code here and heredemo] used ELMo to train a span-based parser with partial annotations.

• First, they showed that recent advances in word representations greatly diminished the need for domain adaptation [transfer learning] when the target domain was syntactically similar to the source domain. As evidence, they trained a parser on the WSJ alone that achieved over 90% $\small F_1$ on the Brown corpus, a standard benchmark used to assess WSJ-trained parsers outside of the newswire domain. For more syntactically distant domains, they provided a simple way to adapt a parser using only dozens of partial annotations. For instance, they increased the percentage of error-free geometry-domain parses in a held-out set from 45% to 73% using approximately five dozen training examples. In the process, they demonstrated a new state of the art single model result on the WSJ test set of 94.3%, an absolute increase of 1.7% over the previous state of the art of 92.6%.

• On a similar experiment using biomedical and chemistry text, they partially annotated 134 sentences and randomly split them into BiochemTrain (72 sentences) and BiochemDev (62 sentences). In BiochemTrain, they made an average of 4.2 constituent declarations per sentence (and no non-constituent declarations). Again, they started with a RSP parser trained on WSJ-Train, and fine-tuned it on minibatches containing annotations from 50 randomly selected WSJ-Train sentences, plus all of BiochemTrain. As with the geometry domain, they got significant improvements using only dozens of partially annotated training sentences.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Word-Level Loss Extensions for Neural Temporal Relation Classification (Aug 2018) [code;  see also OpenNRE] employed pretrained word embeddings for the extraction of narrative containment relations (CR) from clinical text. The aim of CR extraction is, given events A and B, if event A is temporally contained in event B (i.e. if event A happens within the time span of event B: temporal relation extraction). The patient time-line is crucial for making a good patient prognosis and clinical decision support. In this work the authors proposed a neural RC model that learned its word representations jointly on the main task (supervised, on labeled data) and on the auxiliary task (unsupervised, on unlabeled data) in a multi-task setting – ensuring that the embeddings contained valuable information for the main task while leveraging the unlabeled data for more general feature learning. The proposed models used only unlabeled data and a general (news, out of domain) part of speech tagger as external resources. This work showed that training the word-level representations jointly on its main task plus an auxiliary objective resulted in better representations for classification, compared to using pretrained variants. It also constituted a new state of the art for temporal relation extraction on the THYME corpus (a temporally annotated corpus of clinical notes in the brain cancer and colon cancer domain), even without dedicated clinical preprocessing.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Using Multi-task and Transfer Learning to Solve Working Memory Tasks (Sep 2018) proposed a new architecture called Memory-Augmented Encoder-Solver (MAES) that enabled transfer learning to solve complex working memory tasks adapted from cognitive psychology. It used dual RNN controllers, inside the encoder and solver, respectively, that interfaced with a shared memory module and was completely differentiable. The trained MAES models achieved task-size generalization, capable of handling sequential inputs 50 times longer than seen during training, with appropriately large memory modules. The performance achieved by MAES far outperformedk existing and well-known models such as LSTM, NTM [neural Turing machines] and DNC [differentiable neural computers] on the entire suite of tasks evaluated.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “In this paper, we present mean-max A [attention autoencoder] to learn universal sentence representations from unlabelled data. Our model applies the MultiHead self-attention mechanism both in the encoder and decoder, and employs a mean-max pooling strategy to capture more diverse information of the input. To avoid the impact of ‘teacher forcing training,’ our decoder performs attention over the encoding representations dynamically. To evaluate the effectiveness of sentence representations, we conduct extensive experiments on 10 transfer tasks. The experimental results show that our model obtains state-of-the-art performance among the unsupervised single models. Furthermore, it is fast to train a high-quality generic encoder due to the paralleling operation. In the future, we will adapt our mean-max AAE to other low-resource languages for learning universal sentence representations.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• “… we introduce Transfer Incremental Learning using Data Augmentation (TILDA). TILDA is based on pre-trained DNNs as feature extractor, robust selection of feature vectors in subspaces using a nearest-class-mean based technique, majority votes and data augmentation at both the training and the prediction stages. Experiments on challenging vision datasets demonstrate the ability of the proposed method for low complexity incremental learning, while achieving significantly better accuracy than existing incremental counterpart.”

[Image source. Click image to open in new window.]

• “We propose an effective multitask learning setup for reducing distant supervision noise by leveraging sentence-level supervision. We show how sentence-level supervision can be used to improve the encoding of individual sentences, and to learn which input sentences are more likely to express the relationship between a pair of entities. We also introduce a novel neural architecture for collecting signals from multiple input sentences, which combines the benefits of attention and maxpooling. The proposed method increases AUC by 10% (from 0.261 to 0.284), and outperforms recently published results on the FB-NYT dataset.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Multi-Task Graph Autoencoders (Nov 2018) [code] examined two fundamental tasks associated with graph representation learninglink prediction, and node classification. They presented a new autoencoder architecture capable of learning a joint representation of local graph structure and available node features for the simultaneous multi-task learning of unsupervised link prediction and semi-supervised node classification. Their simple yet effective and versatile model was efficiently trained end-to-end in a single stage, whereas previous related deep graph embedding methods required multiple training steps that were difficult to optimize. Empirical evaluation on five benchmark relational, graph-structured datasets demonstrated significant improvement over three strong baselines for graph representation learning.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “We present two architectures for multi-task learning with neural sequence models. Our approach allows the relationships between different tasks to be learned dynamically, rather than using an ad-hoc pre-defined structure as in previous work. We adopt the idea from message-passing graph neural networks and propose a general graph multi-task learning framework in which different tasks can communicate with each other in an effective and interpretable way. We conduct extensive experiments in text classification and sequence labeling to evaluate our approach on multi-task learning and transfer learning. The empirical results show that our models not only outperform competitive baselines but also learn interpretable and transferable patterns across tasks.”