Deep migration learning in natural language processing - text pre-training

[Read More] How to make text pre-trained like images? This is a cryptic guide to universal sentence encoders.

Author Dipanjan (DJ) Sarkar

Compile | Xiaowen

brief introduction

Transfer learning is an exciting concept, We attempt to use prior knowledge to move from one domain and task to another。 Inspired by our humanity itself, We have an innate ability, I.e. not learning everything from scratch。 We transfer and use our knowledge from what we have learned in the past to handle a wide variety of tasks。 With computer vision, We then have excellent large data sets, for exampleImageNet, On top of it we can get a world-class set of、 State-of-the-art pre-trained models to leverage migration learning。 But what about natural language processing?? Consider that text data is so diverse、 Filled with noise as well as unstructured, It's a serious challenge.。 We were recently in Text embedding Some success in this regard, includeWord2vec、GloVe harmony FastText et cetera, I'm in the middle of a discussion about“ Feature engineering of text data”[1] All of these methods are described in the article。

In this article, we will show several state-of-the-art general-purpose sentence embedding encoders, which tend to give surprisingly good performance especially when compared to Word embedding models on a small amount of data for migration learning tasks. We will present several types of models.

  • Baseline Averaged Sentence Embeddings
  • Doc2Vec
  • Neural-Net Language Models (Hands-on Demo!)
  • Skip-Thought Vectors
  • Quick-Thought Vectors
  • InferSent
  • Universal Sentence Encoder

We will try to cover the basic concepts and show some examples of manual manipulation using python and TensorFlow, focusing on sentiment analysis in text classification problems.

Why are we so crazy about embedding (embedding)?

An embedding is a fixed-length vector that is typically used to encode and represent an entity (document, sentence, word, graphic)

I talked about the need for embedding in textual data and NLP contexts in my previous post [2]. But for the sake of convenience, I will repeat it here briefly. In terms of speech or image recognition systems, we already have information in the form of rich dense feature vectors embedded in high-dimensional datasets, such as audio spectrograms and image pixel intensities. However, when it comes to raw text data, especially count-based models such as the Bag of words model, we deal with individual words, which may have their own identifiers and do not capture the semantic relationships between words. This leads to a large number of sparse word vectors for text data, so if we don't have enough data, we may get very poor models or even overfit the data because of the curse of dimensionality.

Contrasting feature representations of images, audio, and text

Predictive methods , for example, neural network-based language models that attempt to predict words from their neighbors, observing sequences of words in the corpus, and giving us dense word embedding representations in the process of learning distributed representations.

Now you're probably thinking, we've got a bunch of vectors from the text, now what? If we have a good digital representation of text data that captures even context and semantics, we can use it for a wide variety of downstream real-world tasks, such as sentiment analysis, text classification, clustering, summarization, translation, and so on. The fact that machine learning or deep learning models can run on these numerical and embedding representations is key to encoding the textual data used by these models.

Text embedding

One of the big trends here is to identify what is called " Universal embeddings (universal embeddings) ", which is basically a pre-trained embedding representation obtained by training a deep learning model on a large corpus. This allows us to use these pre-trained (general) embedding representations in a wide variety of tasks, including scenarios that lack constraints such as sufficient data. This is a perfect example of transfer learning, using prior knowledge of a pre-trained embedding representation to solve a completely new task! The following figure shows some of the recent trends in generic word embedding (word embedding) and sentence embedding (sentence embedding).

Recently in the generic word& Trends in sentence embedding

source (of information etc):

There are some interesting trends in the chart above, including Google's Universal Sentence Encoder, which we will explore in detail in this article. Now, let us briefly introduce trends and developments in word and sentence embedding models before we delve into generic sentence encoders.

Trends in word embedding models

Word Embedding Models (WEMs)are some of the more mature models that have been developed since Word2vec in 2013. The three most common models for embedding word vectors in continuous vector spaces based on semantic and contextual similarity using deep learning (unsupervised methods) are.

  • Word2Vec
  • GloVe
  • FastText

These models are based on the principle of distributional assumptions in the field of distributional semantics, which tells us that words that occur and are used in the same context are semantically similar and have similar meanings.

Another interesting model that has recently developed in this area is the Allen Institute for Artificial Intelligence (AI) developed byELMo (Embeddings from Language Models) model.

Basically, ELMo gives us word embeddings learned from a deep bi-directional language model (biLM) that is typically pre-trained on a large corpus of text, thus enabling migration learning and these embeddings to be used across different NLP tasks. Allen AI tells us that ELMo representations are context-aware, deep, and character-based, and that it uses morphological cues to form representations, even for OOV (out-of-vocabulary) tokens.

Trends in generic sentence embedding models

Sentence Embedding (Sentence Embedding) The concept of is not a very new one, since one of the easiest ways to construct word embeddings is to use averaging to construct baseline sentence embedding models.

A baseline sentence embedding model can be built by averaging the individual word embeddings of each sentence (somewhat analogous to a bag of words where we lose the context and word sequences inherent in the sentence). The diagram below shows how this is achieved.

Of course, there are more sophisticated methods, such as embedding representations of words in sentences for Linear weighted combinations

Doc2Vec It is also a very popular method proposed by mikolov et al. They propose paragraph vectors, an unsupervised algorithm that learns fixed-length feature embeddings from variable-length text (e.g., sentences, paragraphs, and documents).

Word2Vec vs. Doc2Vec (Source:

Building on the above description, the model represents each document with a dense vector that is trained to predict words in the document, with the only difference being the use of paragraph or document ids along with regular word tokens to construct the embedding representation. Such a design allows this model to overcome the disadvantages of the bag-of-words model.

Neural Network Language Model (NNLM) was proposed by Bengio et al. in 2003. They discuss learning distributed representations of words, allowing each training sentence to provide information to the model about semantically adjacent sentences. The model simultaneously learns a distributed representation of each word, along with a probability function for the word sequence, and represents it in these forms. Generalization is because a sequence of words that has never been seen before has a high probability if it is composed of words that are similar to the words that make up an already occurring sentence.

Google has built a generic sentence embedding modelnnlm-en-dim128[3], This is a marker-based Text embedding, Using a three-layer feedforward neural network language model in Englishgoogle news200B Training on the corpus。 The model maps arbitrary text to128 dimensional embedding。 We'll be using this in the next demo soon。

Skip-Thought Vectors It is also one of the earliest models in the field of unsupervised learning-based sentence encoders. In their proposed paper, using text continuity, they train an encoder-decoder model that attempts to reconstruct the surrounding sentences of the encoded passage. Sentences that share semantic and syntactic properties are mapped to similar vector representations.

Skip-Thought Vectors (Source:

This is like the skip-gram model, but for sentences, i.e. we are trying to predict the surrounding sentences for a given source sentence.

Quick Thought Vectors is a more recent method used to learn sentence expressions. An effective framework for learning sentence representations is detailed in the original paper. Interestingly, they redefine the problem of predicting the context in which a sentence appears as a classification problem by replacing the decoder with a classifier in the regular codec structure.

Quick Thought Vectors (Source:

Thus, given a sentence and the context in which it appears, the classifier distinguishes between contextual sentences and other contrasting sentences based on their embedding representation. Given an input statement, it is first encoded using some function, but instead of generating a target sentence, the model selects the correct target sentence from a set of candidate sentences. Viewing generation as the selection of a sentence from all possible sentences can be seen as a discriminative approximation to the generation problem.

InferSet is a supervised learning method for learning generic sentence embeddings based on natural language inference data. This is hard-core supervised migration learning, and just like we are trained on the ImageNet dataset for computer vision, they use supervised data from the Stanford Natural Language Inference dataset to train generic sentence representations. The dataset used by the model is a SNLI dataset consisting of 570 k manually generated English sentence pairs, which captures natural language inference for understanding sentence semantics.

InferSent training scheme (Source:

Based on the architecture described in the above figure, we can see that it uses a shared statement encoder which outputs a representation for premise u and hypothesis V. Once the sentence vector is generated, 3 matching methods can be used to extract the relationship between u and v.

  • Concatenation (u, v)
  • Element-wise product u ∗ v
  • Absolute element-wise difference |u − v|

The generated vectors are then fed into a triple classifier consisting of multiple fully connected layers.

Google's Universal Sentence Encoder Universal Sentence Encoder is one of the latest and best general-purpose sentence embedding models, released in early 2018. The Universal Sentence Encoder encodes arbitrary text into 512-dimensional embeddings that can be used for a variety of NLP tasks, including text classification, semantic similarity, and clustering. It is trained for a variety of data sources and a variety of tasks, with the aim of dynamically accommodating a wide variety of natural language understanding tasks that require modeling the meaning of sequences of words, not just individual words. Their main finding is that transfer learning using sentence embeddings tends to outperform transfer learning at the word embedding level.

Understanding our text classification problem

Now it's time to put these generic sentence encoders into action, and we demonstrate them next. Our demonstration today focuses on a very popular NLP task of classifying text in the context of sentiment analysis. The dataset used in the following demonstration can be downloaded in [4] or [5].

This dataset has a total of 50,000 movie reviews, 25k of which have positive sentiment and 25k have negative sentiment. We will train our model on a total of 30,000 comments, cross-validate on 5,000 comments, and use 15,000 comments as our test dataset. The main objective is to correctly predict the positive or negative emotions of each evaluation.

Generic sentence embedded in action

Now that we've clarified our main goal, let's put the universal sentence encoder into action! My setup is an 8 CPU, 30 GB, 250 GB SSD and an Nvidia Quadro P4000.


We start with the installation of tensorflow-hub, which allows us to easily use these sentence encoders.

Ok, next load the module that will be used in this tutorial.

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd

The following command helps you check if TensorFlow will use a GPU (if you have one set up).

In [12]: tf.test.is_gpu_available()
Out[12]: True
In [13]: tf.test.gpu_device_name()
Out[13]: '/device:GPU:0'

Loading and viewing datasets

We can now load the dataset and view it using pandas.

We encode the columns of the sentiment representation as 1 and 0.

Our movie review dataset

Building training, validation and test datasets

Before we start modeling, we will create training, validation and testing datasets. We will use 30,000 comments for training, 5,000 for validation, and 15,000 for testing. You can use train_test_split() from scikit-learn. I was just lazy and segmented the dataset with a simple list slice.

((30000,), (5000,), (15000,))

Basic text processing

We have some basic text pre-processing that needs to be done to remove some noise from our text, such as unnecessary special characters, html tag removal, etc.

The following code helps us to build a simple but effective text system.

Now let's pre-process the dataset using the functions implemented above.

Constructing data ingestion functions

Since we will be implementing our model in TensorFlow using the tf.estimator API, we need to define some functions to build the data and feature engineering pipline to feed the data into our model during training. We make use of numpy_put_fn(), which helps to input a large number of numpy arrays into the model.

We are now ready to build our model!

Deep learning modeling with a generic sentence encoder

Before building the model, it is first necessary to define the utterance embedding features using the generic sentence encoder. We can use the following code to do this.

INFO:tensorflow:Using /tmp/tfhub_modules tocache modules.

We will build a feedforward DNN with simply only two hidden layers, now just a standard model without too much complexity, because we want to see how well these embeddings perform on a simple model. Here, we are utilizing a migration learning approach in the form of pre-trained embeddings.

Model training

On our validation dataset, we obtained an overall accuracy close to 87%, with an AUC of 94% on such a simple model, which is pretty good!

Model assessment

Now, let us evaluate the overall performance of our model on the training and test datasets.

We obtained an overall accuracy of close to 87% on the test data, consistent with our previous observations on the validation dataset. So this should give you an idea of how easy it is to utilize pre-trained generic utterance embeddings without having to worry about the hassles of feature engineering or complex modeling.

Reward: transfer learning of different generic sentence embeddings

Now let us try to build different deep learning classifiers based on different sentence embeddings. We will try the following.

  • NNLM-128
  • USE-512

We will also discuss here the two most prominent approaches to migration learning.

  • Model building using freezed pre-trained statement embedding
  • Build a model in which we fine-tune and update the pre-trained sentence embeddings during training

We can now use the method defined above to train our model.

I describe the important evaluation metrics in the output above, and as you can see, we get some good results from our model. The following table summarizes these comparative results in a nice way.

Contrast the different generic sentence encoders

It looks like Google's Universal Sentence Encoder fine-tuning gave us the best results for our test data. Let's load this saved model and evaluate it against the test data.

[0, 1, 0, 1, 1, 0, 1, 1, 1, 1]

One of the best ways to assess model performance is to visualize model predictions in the form of confusion matrices.

Confusion matrix from our best model predictions

We can also output other important metrics including accuracy, recall, and F1.

We obtained a good overall model accuracy and F1-score of 90% of the test data.


For different NLP tasks, generic sentence embeddings are certainly a big step forward in supporting migratory learning. In fact, we have seen models like ELMo, the Universal Sentence Encoder, and ULMFiT really make headlines as it demonstrates that pre-trained models can be used to achieve state-of-the-art results on NLP tasks. Sebastian Ruder, a well-known research scientist and blogger, tweeted about the same issue.

I'm very excited about the further spread of NLP and the future that will enable us to solve complex tasks with ease!

Reference link.






Link to original article.


1、Another exchange has been hacked How to effectively protect your digital assets in the age of decentralization
2、From installation to data capture and storage turns out to be so easy
3、IBM Storage Redaction Compression Technology No 3 From Behind to Ahead
4、The difference between Henan Baidu Cloud and Baidu Cloud is What are the advantages of going to the cloud relative to traditional server enterprises
5、Warmly welcome Zhang Zhiling Director of Shangqiu College Library to visit Super Star Group Henan Branch

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送