Further understanding of recurrent neural networks

** Parameter Learning**

Parameter learning for recurrent neural networks can be performed by the Back propagation Through Time (BPTT) algorithm [Werbos, 1990]. An example of the backpropagation algorithm over time is given in Figure 6.6.

Taking stochastic gradient descent as an example, given a training sample (x, y), where x = (x1, - - - , xT) is an input sequence of length T and y = (y1, - - - , yT) is a sequence of labels of length T . That is, at each moment t, there is a supervised information yt, and we define the loss function at moment t as

Here L is a differentiable loss function, such as cross-entropy. Then the loss function on the whole sequence is

The gradient of the loss function L of the whole sequence with respect to the parameter Uij is :

where denotes the "direct" derivative, i.e., the direct derivative of Uij appearing in Eq. (6.2), gives

where [hk]j is the jth dimension of the hidden state at the kth moment. 0 for all but row i.

We define as the derivative of the loss at time t with respect to the net input zk of the hidden neuron at time k. Then

Substituting Eqs. (6.15) and (6.12) into Eq. (6.11) yields

Write the above equation in matrix form as

Similarly, the gradient of L with respect to the weights W and the bias b is

Long-term dependence problem Expanding equation (6.17) yields

We defineγ = UTdiag(f′(zi))， Then inside the brackets in the above equation isγt−k。 as if indeedγ > 1， propert−k ∞ time，γt−k∞， It can cause system instability， Also known as a ladder The problem of the explosion of the degree(Gradient Exploding Problem); contrary， in case γ < 1， proper t−k ∞ time，γt−k0， A gradient disappearance problem similar to that of deep feedforward neural networks can occur(Gradient Vanishing Problem)。

In the application of recurrent neural networks, since the nonlinear activation functions are often used as logistic functions or tanh functions as nonlinear activation functions, the derivative values of which are less than 1; And the weight matrix UT is not too large, so the gradient vanishing problem often occurs. Defining UT ≤ γu ≤ 1 and diag[f′(zi)] ≤ γf ≤ 1, we have

After t - k propagations, the

If the time interval t-k is too large, it will converge to 0.

While simple recurrent networks can theoretically establish dependencies between states at long intervals, the

However, due to the gradient explosion or disappearance problem, only short-term dependencies can actually be learned. This is the so-called Long-Term Dependencies Problem.

** Improvement programmes**

To avoid the gradient explosion or vanishing problem, one has to try to make diag(f′(zi))UT= 1. The most straightforward way is to choose suitable parameters while using a nonsaturating activation function such that diag(f ′ (zi))UT ∼ 1. However, this approach requires sufficient manual tuning experience, which limits the widespread use of the model.

A better way to do this, besides tuning the reference, is to change the model, for example by letting U = 1 while using f′(zi) = 1, i.e.

where W is the weight parameter and g(-) is the nonlinear activation function.

The above improved model makes the dependence between ht and ht-1 linear and the weight coefficient is 1, so that there is no gradient explosion or vanishing problem. However, this change also loses the property of nonlinear activation of neurons on feedback edges, and therefore also reduces the representational power of the model.

To avoid this drawback, we can use a more efficient improvement strategy:introduce a new state ct dedicated to linear feedback passing, while the information in ct is passed nonlinearly to ht.

where ht is still the state of the neuron at time t; And ct is a new memory unit to hold information about the historical moment.

However, there are still certain problems with this improvement. The memory cell c continuously accumulates and stores new input information, and saturation occurs. Assuming that g(-) is a logistic function, c becomes larger and larger as time grows, causing h to become saturated. That is, the information that can be stored in a memory cell, i.e., the Network Capacity, is limited. As the memory unit stores more and more content, it loses more and more information. To solve this problem, Hochreiter and Schmidhuber [1997] propose a very good solution by introducing a gate mechanism to control the rate of accumulation of information, with the option to forget previously accumulated information. This is the long and short term memory neural network introduced in the next section.

Author: Qiu Xipeng Chief Scientist of Rhino Language Technology, School of Computer Science, Fudan University, Associate Professor

Main research interests: deep learning, natural language processing, automated question and answer systems, and representation learning. He has published more than 40 academic papers in ACL, EMNLP, IJCAI and other Computer Society A journals and conferences. Project leader and lead developer of the open source natural language processing tool FudanNLP.