New work by Kai-Ming He et al. NeurlPS: defining a new paradigm for migration learning

** Link to paper.**https://arxiv.org/abs/1806.05662

** [Abstract]** The dominant approach to deep learning-based transfer learning is generally to learn generic feature vectors from one task that can be transferred to other tasks, such as word embeddings in language and pre-trained convolutional features in vision (** For example, the pre-training of the imagenet model is also a migration** ), that is, in** characteristic dimension** Do the migration. However, these approaches typically only migrate monadic features, while largely ignoring more structured representations of graph structure. In this paper, we explore the possibility of extracting from large-scale unlabeled data** (unsupervised learning)** Learning to capture data unit pairs** (e.g. words or pixels)**Generic hidden relationship diagram of dependencies between， and the possibility of passing these graphs to downstream tasks。 Our proposed migration learning framework improves the performance of various tasks， Includes Q&A system、 natural language inference、 Emotional analysis harmony image classification。 Our tests also show that， The graphics learned are generic， With the figure untrained， Can be migrated to different embeds（ include GloVe embed、ELMo embed harmony mission-specific RNN Hidden units） or no embedded unit（ If the graphic pixels）。

Figure 2: Title of the paper

Progress in deep learning relies heavily on things like convolutional networks harmony Complex network architectures like recurrent networks and attention mechanisms, etc.。 Thanks to its built-in「 innate advantage」， So while these architectures have a high degree of representational power， They operate mainly on lattice-like or sequential structures。 therefore，CNN harmony RNN relies heavily on powerful representational capabilities to model complex structural phenomena， They do not make explicit use of the structure harmony graphical representation。

In contrast, various real-world data exhibit a richer relational graph structure than simple grid-like or sequential structures. In the field of language, for example, linguists use parse trees to represent syntactic dependencies between words; Information retrieval systems use knowledge graphs to reflect entity relationships; The prevalence of example structures in almost all natural language data hints at the possibility of cross-task transfer. These observations can also be generalized to other fields such as vision, where modeling the relationship between pixels has proven useful.

Figure 2: Comparison of traditional transfer learning and this paper's transfer learning

as shown 2： Traditional transfer learning versus new transfer learning frameworks。GLoMo Not a migration characteristic， Rather, the graphics output from the migration network。 Graphics and task-specific features（task B features）（ For example, embedding or hiding status） multiply (math.)， to generate structure-aware features for other tasks（task B）。 so-calledgraph It's actually a dependency matrix.， Call it family. harmony matrices。

1. breaking the standardized norms of feature-based deep transfer learning and defining a new paradigm of transfer learning.

2. a new unsupervised hidden graph learning framework called GLoMo (Graphs from LOw-level unit MOdeling) is proposed

3、 The proposed framework Decoupled diagram（graph） and characteristics（feature）， A data-driven approach to learning generic structures in data。 We are interested in learning transferable hidden relationship graphs， where the nodes of the hidden graph are the input units， For example, all the words in the sentence。 The task of hidden relational graph learning is to learn a similarity matrix， of which weights（ Probably zero） Capture the dependencies between any pair of input units；

4、 Experimental results make clear，GLoMo Improved Q&A、 natural language inference harmony Performance of various language tasks such as sentiment analysis。 We also confirmed that， The graphics learned are generic， The learned diagram without training， Works well with various feature sets（GloVe embed [28]、ELMo embed [ 29 ] harmony mission-specific RNN statuses） Task use。 And the classification tasks done in the image domain also prove its effectiveness。

Given a one-dimensional input, x = (x 1 , - - - , x T ), x t represents at position t , and T is the length of the input sequence. The goal is to learn (T × T ) the affinity matrix G. The matrix G is asymmetric and G ij captures the dependence between xi and xj .

The entire framework is divided into** Unsupervised learning phase** harmony** migration phase** , two networks need to be trained in the unsupervised learning phase, one is a graph predictor network called g and a feature predictor network called f .

for g The inputs to the x, G = g(x). G It's a three-dimensional tensor （L×T ×T），L is the number of layers of the network generating the graph，T×T It's a kiss. harmony matrices。

Then the G harmony primitive x input f (feature predictor network) in。

(located) at migration phase， The input is x' ， start withg come up with close harmony matrices G = g(x‘ )， Then theG multiply by Mandate-specific characteristics Use this as input to the embedding or hidden state， And then at that point web f It's ignored.。

The proposed framework

graph predictor contains a **multi-layer CNNs** , one. **key CNN**, harmony an**query CNN**. The input is x, key CNN output sequence (k 1 , - - - , k T ) and query CNN output sequence (q 1 , - - - , q T ). The following equation is calculated for G at the Lth level.

4.png

The input for Feature Predictor is F (....) (as in the following equation) and the iterative calculation of the G, F sequence as in the following equation.

5.png

6.png

The v operation in the above equation can be a GRU unit, etc.

D represents the length of the predicted text, at the top level. At position t,the hidden state in the RNN is initialized with the corresponding element in the F (ft) feature sequence, and finally the prediction corresponding to x t is derived.

7.png

There are many requisites in the framework 1. Decoupling diagrams and features 2. sparsity (with relu not softmax) 3. Layered graphical representation 4. Module level objectives 5, sequential prediction (traditionally predicting the next, this paper expands to predict length up to D)

This section describes how to migrate graph predictor g to a downstream task. Input characteristics x' From downstream missions, formula8.png explains that each layer will produceG，m is the weight。

8.png

9.png

The accuracy was improved in both Q&A and image classification comparison experiments.

8.png

9.png

** Notes.** Some of the small parts of this article were taken from Heart of the Machine