A step-by-step breakdown of AttentionisAllYouNeed!

In this article, we'll break down the Attention is all you need article step by step through a combination of detailed dissection as well as code.

This article can be downloaded at: https://arxiv.org/abs/1706.03762

Some of the images in this article are from the article.__https://mp.weixin.qq.com/s/RLxWevVWHXgX-UcoxDS70w__ , very well written!

This article talks about the details while working with the code in action, which can be found at: https://github.com/princewen/tensorflow_practice/tree/master/basic/Basic-Transformer-Demo

The data address is: https://pan.baidu.com/s/14XfprCqjmBKde9NmNZeCNg Password:lfwu

Okay, without further ado, let's get to the point! We present the structure of the model, step by step, from simple to complex!

The overall framework of the model is as follows.

The overall architecture may seem complex, but it is actually a Seq2Seq structure that, to simplify, looks like this.

The output of the encoder and the decoder are combined as follows, i.e. the output of the last encoder will be combined with the decoder at each layer.

Well, our main concern is the internal structure of each layer of Encoder and each layer of Decoder. As shown in the figure below.

It can be seen that each layer of Encoder has two operations, namely Self-Attention and Feed Forward, while each layer of Decoder has three operations, namely Self-Attention, Encoder-Decoder Attention, and Feed Forward operations. Both Self-Attention and Encoder-Decoder Attention here use the Multi-Head Attention mechanism, which is what we focus on in this paper.

Before we present, we present our data, which, after processing, are as follows.

Quite simply, the top part is our x, which is the input to the encoder, and the bottom part is the y, which is the input to the decoder, which is a machine translation of the data, with each id in x representing a word id in one language and each id in y representing a word id in another language. The part that follows with a 0 is the padding part, which means that the length of this sentence does not reach the maximum length we set to fill.

Given our input data, we first convert it to the corresponding embedding, and since we want to mask out the padding later in the computation of attention, here we directly assign a value of 0 to the embedding of the padded part. The function for Embedding is as follows.

def embedding(inputs, vocab_size, num_units, zero_pad=True, scale=True, scope="embedding", reuse=None): with tf.variable_scope(scope, reuse=reuse): lookup_table = tf.get_variable('lookup_table', dtype=tf.float32, shape=[vocab_size, num_units], initializer=tf.contrib.layers.xavier_initializer()) if zero_pad: lookup_table = tf.concat((tf.zeros(shape=[1, num_units]), lookup_table[1:, :]), 0) outputs = tf.nn.embedding_lookup(lookup_table, inputs) if scale: outputs = outputs * (num_units ** 0.5) return outputs

In this paper, the Embedding operation is not a normal Embedding but an Embedding with the addition of position information, which we call Position Embedding. Because in the model of this paper, there is no longer such a structure as a recurrent neural network, the sequence information can no longer be captured. But sequence information is very important and represents the global structure, so it is important to make use of the subjunctive relative or absolute position information of the sequence. The formula for calculating the location information is as follows.

where pos stands for the first few words and i stands for the first few dimensions in the embedding. The code for this part is as follows, and for the padding part we still use the all-0 treatment.

def positional_encoding(inputs, num_units, zero_pad = True, scale = True, scope = "positional_encoding", reuse=None): N,T = inputs.get_shape().as_list() with tf.variable_scope(scope,reuse=True): position_ind = tf.tile(tf.expand_dims(tf.range(T),0),[N,1]) position_enc = np.array([ [pos / np.power(10000, 2.*i / num_units) for i in range(num_units)] for pos in range(T)]) position_enc[:,0::2] = np.sin(position_enc[:,0::2]) # dim 2i position_enc[:,1::2] = np.cos(position_enc[:,1::2]) # dim 2i+1 lookup_table = tf.convert_to_tensor(position_enc) if zero_pad: lookup_table = tf.concat((tf.zeros(shape=[1,num_units]),lookup_table[1:,:]),0) outputs = tf.nn.embedding_lookup(lookup_table,position_ind) if scale: outputs = outputs * num_units ** 0.5 return outputs

So for the input, we call the two functions above and add the results to get the final Position Embedding result as follows.

self.enc = embedding(self.x, vocab_size=len(de2idx), num_units = hp.hidden_units, zero_pad=True, # permitpadding have always been0 scale=True, scope="enc_embed") self.enc += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]),0),[tf.shape(self.x)[0],1]), vocab_size = hp.maxlen, num_units = hp.hidden_units, zero_pad = False, scale = False, scope = "enc_pe")

Attention is really about calculating a degree of relevance, see the following example.

Attention can usually be described as follows, represented as mapping query(Q) and key-value pairs to the output, where query, each key, and each value are vectors and the output is a weighting of all values in V, where the weights are computed from Query and each key, computed in three steps.

1) Calculate the similarity of comparing Q and K, denoted by f.

2) The obtained similarity is softmax normalized to.

3) For the computed weights, the weighted sum of all the VALUES is done to obtain the Attention vector.

There are four methods for calculating similarity as follows.

In this paper, the way we compute the similarity is the first one, and the Attention mechanism proposed in this paper is called Multi-Head Attention, but before that, we want to introduce its simple version, Scaled Dot-Product Attention.

Calculating Attention starts with query, key and value. As we mentioned earlier, Encoder's attention is self-attention, and the attention inside Decoder is first self-attention and then encoder-decoder attention. The two types of attention here are for query and key-value. For self-attention, the process of computing both query and key-value uses the same input, because we have to calculate our own attention with ourselves; for encoder-decoder attention, the query is computed using the input of the decoder, while the key-value is computed using the output of the encoder, because we have to calculate the similarity between the input of the decoder and each of the encoder.

Therefore the following explanation of attention in this paper is based on self-attention, if it is encoder-decoder attention, just change the input, the rest of the process is the same.

Scaled Dot-Product Attention is illustrated as follows.

Next, we take the above process apart step by step.

Given our input data, we first convert it to the corresponding position embedding, the effect is shown below, with the green part representing the filled part and the full 0 value.

The process of getting Embedding we have described above and will not repeat it here.

Computing Attention starts with Query, Key and Value, and we get all three by a linear transformation. Our input is position embedding, which proceeds as follows.

The code is also simple, in the code below, the embedding of the query and key-value inputs is the same if it is self-attention. The part of the padding is still in green because it is all 0 and the part is still 0 in the result.

# Linear projection Q = tf.layers.dense(queries,num_units,activation=tf.nn.relu) # K = tf.layers.dense(keys,num_units,activation=tf.nn.relu) # V = tf.layers.dense(keys,num_units,activation=tf.nn.relu) #

The next step is to calculate the similarity, as we said before, the dot product is used in this paper, so it is sufficient to dot product Q and K. The procedure is as follows.

The text also divides the square root of dk for similarity, where dk is the embedding length of the key.

The code for this section is as follows.

outputs = tf.matmul(Q,tf.transpose(K,[0,2,1])) outputs = outputs / (K.get_shape().as_list()[-1] ** 0.5)

You may have noticed that this actually gives a matrix of attention, with each row being a query and the similarity of all keys, which for self-attention has the following effect.

We haven't done the softmax normalization operation yet, though, because we still need to do some processing.

The attention matrix just obtained, we still need to do a bit of processing, mainly.

- Some parts of the query and key are padded and these need to be masked with mask, an easy way to do this is to give a small, small value or just make it a 0 value.
- For the decoder's, we cannot see future information, so for the decoder's input, we can only calculate how similar it is to its previous input.

We first mask the filled part of the key, as we introduced before, when embedding, the embedding of the filled part is directly set to all zeros, so we directly mask based on this, i.e., we sum up all dimensions of the vector of embedding to get a scalar, and if the scalar is 0, it means it is the filled part, otherwise it is not.

The code for this section is as follows.

key_masks = tf.sign(tf.abs(tf.reduce_sum(keys,axis=-1))) key_masks = tf.tile(tf.expand_dims(key_masks,1),[1,tf.shape(queries)[1],1]) paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) outputs = tf.where(tf.equal(key_masks,0),paddings,outputs)

After this step, the result is as follows, with the parts we have masked out represented by dark grey in the image below.

The next operation is only for Decoder's self-attention, where we first get a lower triangular matrix that is 1 on the main diagonal and the part below it, and 0 for the rest, and then choose whether to use OUTPUT or a very small number to fill it, depending on whether it is 1 or 0.

diag_vals = tf.ones_like(outputs[0,:,:]) tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() masks = tf.tile(tf.expand_dims(tril,0),[tf.shape(outputs)[0],1,1]) paddings = tf.ones_like(masks) * (-2 ** 32 + 1) outputs = tf.where(tf.equal(masks,0),paddings,outputs)

The result obtained is shown in the following figure.

Next, we mask the query part, along the lines of masking the key, but instead of replacing it here with a very small value, we just make the filled part 0:

query_masks = tf.sign(tf.abs(tf.reduce_sum(queries,axis=-1))) query_masks = tf.tile(tf.expand_dims(query_masks,-1),[1,1,tf.shape(keys)[1]]) outputs *= query_masks

After this step, the final similarity matrix obtained by Encoder and Decoder is as follows, with the results for Encoder at the top and Decoder at the bottom.

Next, we are ready for the softmax operation: the

outputs = tf.nn.softmax(outputs)

Having obtained the similarity matrix of Attention, we can multiply it with Value to obtain the result weighted by ATTENTION:.

This part is a simple matrix multiplication operation, The code is as follows.

outputs = tf.matmul(outputs,V)

This is not the final result, however, and here the text also incorporates the structure of the residual network, i.e., the final result is summed with the input of the queries.

outputs += queries

So a completeScaled Dot-Product Attention of The code is as follows.

def scaled_dotproduct_attention(queries,keys,num_units=None, num_heads = 0, dropout_rate = 0, is_training = True, causality = False, scope = "mulithead_attention", reuse = None): with tf.variable_scope(scope,reuse=reuse): if num_units is None: num_units = queries.get_shape().as_list[-1] # Linear projection Q = tf.layers.dense(queries,num_units,activation=tf.nn.relu) # K = tf.layers.dense(keys,num_units,activation=tf.nn.relu) # V = tf.layers.dense(keys,num_units,activation=tf.nn.relu) # outputs = tf.matmul(Q,tf.transpose(K,[0,2,1])) outputs = outputs / (K.get_shape().as_list()[-1] ** 0.5) # Here is a mask for the padded part, the attention score of these positions becomes very small, there is a padding operation in our embedding operation # The parts of the padding whose embedding are all 0 and add up to 0, we'll be padding a very small number. key_masks = tf.sign(tf.abs(tf.reduce_sum(keys,axis=-1))) key_masks = tf.tile(tf.expand_dims(key_masks,1),[1,tf.shape(queries)[1],1]) paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) outputs = tf.where(tf.equal(key_masks,0),paddings,outputs) # Here it's actually performing a mask operation that doesn't give the model the information it needs to see the future. if causality: diag_vals = tf.ones_like(outputs[0,:,:]) tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() masks = tf.tile(tf.expand_dims(tril,0),[tf.shape(outputs)[0],1,1]) paddings = tf.ones_like(masks) * (-2 ** 32 + 1) outputs = tf.where(tf.equal(masks,0),paddings,outputs) outputs = tf.nn.softmax(outputs) # Query Mask query_masks = tf.sign(tf.abs(tf.reduce_sum(queries,axis=-1))) query_masks = tf.tile(tf.expand_dims(query_masks,-1),[1,1,tf.shape(keys)[1]]) outputs *= query_masks # Dropout outputs = tf.layers.dropout(outputs,rate = dropout_rate,training = tf.convert_to_tensor(is_training)) # Weighted sum outputs = tf.matmul(outputs,V) # Residual connection outputs += queries # Normalize outputs = normalize(outputs) return outputs

Multi-Head Attention is the process of doing H times the Scaled Dot-Product Attention and then combining the outputs. In the paper, it is structured as follows.

The schematic of this part is shown below, where we repeat the similar operation 3 times to get the result matrix for each one, and subsequently stitch the result matrix and go through one more linear operation to get the final result.

Scaled Dot-Product Attention can be seen as Multi-Head Attention with only one Head, and this part of the code is much the same as Scaled Dot-Product Attention, which we post directly as follows.

def multihead_attention(queries,keys,num_units=None, num_heads = 0, dropout_rate = 0, is_training = True, causality = False, scope = "mulithead_attention", reuse = None): with tf.variable_scope(scope,reuse=reuse): if num_units is None: num_units = queries.get_shape().as_list[-1] # Linear projection Q = tf.layers.dense(queries,num_units,activation=tf.nn.relu) # K = tf.layers.dense(keys,num_units,activation=tf.nn.relu) # V = tf.layers.dense(keys,num_units,activation=tf.nn.relu) # # Split and Concat Q_ = tf.concat(tf.split(Q,num_heads,axis=2),axis=0) # K_ = tf.concat(tf.split(K,num_heads,axis=2),axis=0) V_ = tf.concat(tf.split(V,num_heads,axis=2),axis=0) outputs = tf.matmul(Q_,tf.transpose(K_,[0,2,1])) outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5) # Here is a mask for the padded part, the attention score of these positions becomes very small, there is a padding operation in our embedding operation # The parts of the padding whose embedding are all 0 and add up to 0, we'll be padding a very small number. key_masks = tf.sign(tf.abs(tf.reduce_sum(keys,axis=-1))) key_masks = tf.tile(key_masks,[num_heads,1]) key_masks = tf.tile(tf.expand_dims(key_masks,1),[1,tf.shape(queries)[1],1]) paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) outputs = tf.where(tf.equal(key_masks,0),paddings,outputs) # Here it's actually performing a mask operation that doesn't give the model the information it needs to see the future. if causality: diag_vals = tf.ones_like(outputs[0,:,:]) tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() masks = tf.tile(tf.expand_dims(tril,0),[tf.shape(outputs)[0],1,1]) paddings = tf.ones_like(masks) * (-2 ** 32 + 1) outputs = tf.where(tf.equal(masks,0),paddings,outputs) outputs = tf.nn.softmax(outputs) # Query Mask query_masks = tf.sign(tf.abs(tf.reduce_sum(queries,axis=-1))) query_masks = tf.tile(query_masks,[num_heads,1]) query_masks = tf.tile(tf.expand_dims(query_masks,-1),[1,1,tf.shape(keys)[1]]) outputs *= query_masks # Dropout outputs = tf.layers.dropout(outputs,rate = dropout_rate,training = tf.convert_to_tensor(is_training)) # Weighted sum outputs = tf.matmul(outputs,V_) # restore shape outputs = tf.concat(tf.split(outputs,num_heads,axis=0),axis=2) # Residual connection outputs += queries # Normalize outputs = normalize(outputs) return outputs

After the Attention operation, each layer in encoder and decoder contains a fully connected forward network that performs the same operation on each vector of position separately, including two linear transformations and a ReLU activation output.

The code is as follows.

def feedforward(inputs, num_units=[2048, 512], scope="multihead_attention", reuse=None): with tf.variable_scope(scope, reuse=reuse): # Inner layer params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1, "activation": tf.nn.relu, "use_bias": True} outputs = tf.layers.conv1d(**params) # Readout layer params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1, "activation": None, "use_bias": True} outputs = tf.layers.conv1d(**params) # Residual connection outputs += inputs # Normalize outputs = normalize(outputs) return outputs

Encoder has N (default is 6) layers, each layer consists of two sub-layers: 1 ) The first sub-layer is the multi-head self-attention mechanism, which is used to compute the input self-attention; 2 ) The second sub-layer is simply a fully connected network. Each sub-layer simulates the structure of a residual network with the following network schematic.

According to the function we just defined， Its complete The code is as follows.

with tf.variable_scope("encoder"): # Embedding self.enc = embedding(self.x, vocab_size=len(de2idx), num_units = hp.hidden_units, zero_pad=True, # permitpadding have always been0 scale=True, scope="enc_embed") ## Positional Encoding if hp.sinusoid: self.enc += positional_encoding(self.x, num_units = hp.hidden_units, zero_pad = False, scale = False, scope='enc_pe') else: self.enc += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]),0),[tf.shape(self.x)[0],1]), vocab_size = hp.maxlen, num_units = hp.hidden_units, zero_pad = False, scale = False, scope = "enc_pe") ##Drop out self.enc = tf.layers.dropout(self.enc,rate = hp.dropout_rate, training = tf.convert_to_tensor(is_training)) ## Blocks for i in range(hp.num_blocks): with tf.variable_scope("num_blocks_{}".format(i)): ### MultiHead Attention self.enc = multihead_attention(queries = self.enc, keys = self.enc, num_units = hp.hidden_units, num_heads = hp.num_heads, dropout_rate = hp.dropout_rate, is_training = is_training, causality = False ) self.enc = feedforward(self.enc,num_units = [4 * hp.hidden_units,hp.hidden_units])

Decoder has N (default is 6) layers, each layer consists of three sub-layers: 1 ) The first one is Masked multi-head self-attention, which also computes the input self-attention, but because it is a generative process, no results are available for moments greater than i at moment i, only for moments less than i. Therefore, it is necessary to do Mask. 2 ) The second sub-layer is the attention calculation on the input of encoder, here it is still a multi-head attention structure, except that the input is the input of decoder and the output of encoder respectively. 3 ) The third sub-layer is the fully connected network, the same as Encoder.

The network is shown in the following diagram.

The code is as follows.

with tf.variable_scope("decoder"): # Embedding self.dec = embedding(self.decoder_inputs, vocab_size=len(en2idx), num_units = hp.hidden_units, scale=True, scope="dec_embed") ## Positional Encoding if hp.sinusoid: self.dec += positional_encoding(self.decoder_inputs, vocab_size = hp.maxlen, num_units = hp.hidden_units, zero_pad = False, scale = False, scope = "dec_pe") else: self.dec += embedding( tf.tile(tf.expand_dims(tf.range(tf.shape(self.decoder_inputs)[1]), 0), [tf.shape(self.decoder_inputs)[0], 1]), vocab_size=hp.maxlen, num_units=hp.hidden_units, zero_pad=False, scale=False, scope="dec_pe") # Dropout self.dec = tf.layers.dropout(self.dec, rate = hp.dropout_rate, training = tf.convert_to_tensor(is_training)) ## Blocks for i in range(hp.num_blocks): with tf.variable_scope("num_blocks_{}".format(i)): ## Multihead Attention ( self-attention) self.dec = multihead_attention(queries=self.dec, keys=self.dec, num_units=hp.hidden_units, num_heads=hp.num_heads, dropout_rate=hp.dropout_rate, is_training=is_training, causality=True, scope="self_attention") ## Multihead Attention ( vanilla attention) self.dec = multihead_attention(queries=self.dec, keys=self.enc, num_units=hp.hidden_units, num_heads=hp.num_heads, dropout_rate=hp.dropout_rate, is_training=is_training, causality=False, scope="vanilla_attention") ## Feed Forward self.dec = feedforward(self.dec, num_units=[4 * hp.hidden_units, hp.hidden_units])

The output of the decoder goes through a layer of fully concatenated networks and softmax to get the final result, schematically as follows.

In this way, a complete Transformer Architecture we have introduced, for the text is not clear or not in place, you are welcome to leave a comment to correct!

1、 original text：https://arxiv.org/abs/1706.03762
2、__https://mp.weixin.qq.com/s/RLxWevVWHXgX-UcoxDS70w__
3、https://github.com/princewen/tensorflow_practice/tree/master/basic/Basic-Transformer-Demo

** Author(s) concerned.**

Shi Xiaowen, a graduate student in the School of Information of Renmin University of China, an algorithm intern of Meituan Takeaway

sketchbookID: Shi Xiaowen's Learning Diary (https://www.jianshu.com/u/c5df9e229a67)

Heavenly Good Community: https://www.hellobi.com/u/58654/articles

Tencent Cloud: https://cloud.tencent.com/developer/user/1622140