cool hit counter Natural Language Processing - Issue 16 - Neural Network Translation BeamSearch_Intefrankly

Natural Language Processing - Issue 16 - Neural Network Translation BeamSearch


background

Starting from the twelfth issue, the Seq2Seq model and the Attention Mechanism are introduced respectively. These model frameworks are core components of neural network translation (NMT) and chatbot. This issue will bring you the Beam Search section.

review.

Seq2Seq Model

Seq2Seq is actually Encoder-decoder + Sequence, encode one sequence and then decode it into another sequence.

But one obvious problem with Seq2Seq is that when decoding a Vector vector into a sequence, if the sentence is too long, the result will be less than ideal (because of Gradient Vanishing or Explosion)

Attention Mechanism

The problem that Seq2Seq faces is that Encoder only outputs a vector, and based on this vector we need to decode into a long sentence.

Then the Attention Mechanism solution is also very straightforward. It's adding another Vector to the original Vector vector (in the original sense of where we need to Pay Attention)

And this Context Vector is the combination of the weights of the individual words of the original sequence. When we translate I, most of the weight is then assigned to the Hidden State output by the word I

Put another way, its achieving a soft to it. The translated word is soft-paired with the word being translated.

To put it another way, this is the Dynamic Memory Network, which can dynamically extract the needed Hidden State

More details on these can be found in issues 12-15.

Beam Search

From the above review we can see that both Seq2Seq and Attention Mechanism solve the problem of the input value of the decoder part.

When the base Seq2Seq model is used, the Decoder does not get enough hidden State, so the Attention Mechanism gives another dynamic Vector.

But one question that hasn't been addressed in previous issues is how to decompress a Vector into a complete sentence when the Decoder takes it?

In fact, what Decoder does is, given the input Vector, compute the probability distribution of the lexicon.

Below is a screenshot from the NG course, where he is calculating the probability of each word in a 10,000 word bank.

Greedy Search

Now that we know that Decoder's task is to compute the probability distribution of words. Then the most straightforward method must be to take the word with the highest probability. This is also known as Greedy Search.

However, Greedy Search doesn't work well, and that's understandable. Because, its not considering the probability gap. When there are two words with close probabilities, its very arbitrary to choose the one with the highest probability.

Beam Search

Before explaining Beam Search, let's give an example, which is very helpful for my understanding.

Nowadays, many translation models do not predict words but letters. So suppose we are in the process of Decoder and we have generated the hap four letters, then the following are likely to follow.

Happy

Happiness

Happen

For this example, with Greedy Search, its just a matter of choosing the one with the highest probability. Personally, I think the probability is likely to be highest for y. After all, Happy has a higher frequency (this statement doesn't make sense ha just hypothetical hypothesis, since its probability depends mostly on the training text, not our personal senses). Then, Greedy Search will "decisively" choose Y, i.e. Happy

But there's actually a more sensible way to do it.

Step 1. Predicting the fifth letter leaves the two most likely choices, assumed to be (i,e).

Step 2. (a) Based on the two choices in the first step, find the two most probable letters, respectively, assuming (in,im,en,er).

Step 3. To select the letter that should appear in the fifth position based on the combination of two letters, calculating the probability of (in, im, en, er) separately.

For language in particular, this approach has obvious advantages. Because as you know, there are many roots in English, such as er , on, ion, ing, ow and so on. All appear in combinations.

Beam Search is one such method, and its biggest improvement over Greedy Search is that it considers not only the probability of a single word, but also the probability of several words placed together. This can be very effective for a language that has an established pattern in itself.

In addition, Beam Search can specify the number of Beams, for example, two, or three.

To summarize.

The most basic Beam Search is actually a very simple idea. It is when calculating the probability distribution, don't just choose the one with the highest probability. Instead, choose the ones with the highest probability, and then those as alternatives Look further back. Then, the probability is calculated by combining it with the next few words.

Next Issue Preview

Beam Search including Greedy Search actually has a problem in that not predicting a word requires traversing one side of the lexicon. is a problem for larger thesauri. And Beam Search also increases its workload by requiring several traversals.

Although I don't understand it yet, there are definitely improved methods for this problem. I'll bring it to you all when I figure it out.


Recommended>>
1、Can you mine without being technically savvy
2、2018 TechFest Opening Ceremony Why no lights at this years TechFest
3、Read about blockchain in three minutes
4、Bertem Technology was selected as a national public technology service demonstration platform for SMEs
5、Google officially abandoned HTTP and since July it has been marked as unsafe

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号