Attention Model in natural language processing: what and why

Source: Internet
Author: User

/* Copyright Notice: Can be reproduced arbitrarily, please indicate the original source of the article and the author information . */

Author: Zhang Junlin


If you are concerned about the progress of deep learning in natural language processing, I am sure you have heard the word attention model (sometimes referred to as the AM model). The AM model should be one of the most important advances in the field of NLP over the past year and has proved effective in many scenarios. Sounds am very big on, in fact its basic idea is quite intuitive concise. The writer can swear to the lamp: After you read this lengthy article and its subsequent articles, you will be able to thoroughly understand what am is and easily read any mathematical formulas that seem to be complicated in the paper. Well, this ad is pretty appealing, especially for those with a mathematical formula for Parkinson's disease.

Before the play starts, let's take a humorous digression.

| Introduction and Nonsense

You should often hear the male who is caught in bed often laments: the sixth sense of women is usually very accurate, of course, the woman here is generally referred to the male wife or girlfriend, of course, may be one of his female temperament boyfriend. To me, men's sixth sense is actually not bad (the "man" here refers to the author himself, of course, not the "male" quoted above, in order to avoid confusing special statement). When I first saw the name of the attention model in the field of machine learning, my first instinct was this: it was a concept introduced from the brain's attention model in cognitive psychology. A few years ago, in my young ignorance of the pattern of love, there was a while addicted to the work of the human brain mechanism, a lot of reading the cognitive psychology of books and papers, and the general attention model as a separate chapter of the book. Please allow me to show my profound knowledge.

Attention is actually very interesting, but it's easy to be overlooked. Let's take a visual look at what is the brain's attention model. First of all, please open your eyes and confirm that you are in a conscious state; the second step is to find the word "Attention Model" (the "word" in front of the two English words, ... ^@@^) and stare at it for three seconds. Well, suppose the time stops now, what do you see in your eyes and in your head in these three seconds? Right, is "Attention Model" These two words, but you should realize that in fact, in your eyes is in addition to the two words of the whole picture, but in your stare at this three seconds, the time is still, all things interest, as if the world is only me and you ... sorry, a bunch of scenes, As if the world has only the "Attention Model" of the two words. What is it? This is the brain's attention model, which means that you see the whole picture, but at a certain point, your mind and attention focus on one part of the picture, while the rest is in your eyes, but the attention resources you allocate to them are few. In fact, as long as you open your eyes, the attention model will not be engraved on you to play a role, such as you cross the road, in fact, your attention will be more distributed to traffic lights and vehicles, although you see the whole world, such as you are very careful to encounter the opposite sex, Your attention will now be distributed more in the light of the opposite sex, although at this moment you see the whole world, but they do not exist for you is the same ....

This is the brain's attention model, in the end it is a resource allocation model, at a particular point in time, your attention is always focused on one of the focus parts of the picture, and the other part of the blind.

In fact, the mechanism of the attention model in deep learning is the same as when you see a hormone-driven mechanism of attention distribution.

Well, the foreplay is over and the play begins.


| Encoder-decoder Frame

This article only discusses the text processing field of AM model, in the image processing or (image-picture title) generation and other tasks also have a lot of scenarios will be applied am model, but we only talk about the text field of AM model, in fact, the picture field is the same mechanism.

To mention the AM model in the field of text processing, we have to first talk about the Encoder-decoder framework, because most of the present AM model is attached to the Encoder-decoder framework, of course, the AM model can be regarded as a general idea, It is not dependent on the Encoder-decoder model, which needs attention.

Encoder-decoder Framework can be regarded as a kind of research pattern in the field of text processing, the application scenario is very wide, it deserves to be discussed very carefully, but because the focus of this article is on AM model, here we just talk about some of the things we have to talk about, Detailed Encoder-decoder models are considered later in this article. is the most abstract representation of the Encoder-decoder framework commonly used in the field of text processing:

Figure 1. Abstract Encoder-decoder Framework

The Encoder-decoder framework can be intuitively understood: it can be thought of as a generic processing model for processing a sentence (or chapter) to produce another sentence (or chapter). For sentence pairs <x,y>, our goal is to give the input sentence X, expecting to generate the target sentence Y through the Encoder-decoder framework. X and y can be in the same language or in two different languages. and X and Y are made up of individual word sequences, respectively:

Encoder, as the name implies, encodes the input sentence x and transforms the input sentence into an intermediate semantic representation of C through a nonlinear transformation:

For the decoder decoder, the task is to generate the word yi that I want to generate, based on the intermediate semantics of sentence X, representing the historical information generated by C and the previous y1,y2....yi-1.

Each Yi is produced in turn so that it looks like the entire system generates the target sentence y based on the input sentence x.

Encoder-decoder is a very general calculation framework, as for Encoder and decoder specific use of what models are determined by the researchers themselves, common such as cnn/rnn/birnn/gru/lstm/deep LSTM, etc. There are a lot of changes here, and it's possible that a new combination can save a paper, so sometimes innovation in science is that simple. For example, I use CNN as encoder, with RNN as decoder, you use BIRNN as encoder, with deep lstm as decoder, then is an innovation. So is preparing to jump to suppress the strength of the students can be saved from the roof down, of course, is to go down, not to let you jump down, you can take a good look at this model, the various permutations and combinations are tried, as long as you can propose a new combination and proved effective, then congratulations: donor, you can graduate.

Pull it away and bring it back.

Encoder-decoder is an innovative game big kill device, on the one hand, as mentioned above, can engage in a variety of different model combinations, on the other hand, it has a lot of application scenarios, for example, for machine translation,<x,y> is corresponding to different language sentences, such as X is an English sentence, Y is the corresponding Chinese sentence translation. For example, for the text digest, X is an article, y is the corresponding summary, and then for the dialogue robot, X is someone's sentence, y is the answer of the dialogue robot; In short, too much. Ah, the donor, listen to swaiiow words, hurriedly from the rooftop down, countless innovations are waiting for you to explore it.

| Attention Model

The Encoder-decoder model shown in Figure 1 does not reflect the "attention model", so it can be thought of as a distracted distraction model. Why is it that it is not concentrating? Observe the following generation process for each word in the target sentence y:

Where f is the nonlinear transformation function of decoder. It can be seen from here that, in generating the word of the target sentence, no matter what word is generated, whether it is y1,y2 or y3, they use the sentence X of the semantic encoding C is the same, no difference. While the semantic encoding C is generated by encoder encoding of each word in sentence x, which means that no matter which word is generated, y1,y2 or y3, in fact, any word in sentence X has the same influence on the generation of a target word yi. There is no difference (in fact, if the encoder is RNN words, the more the theoretical input of the word impact, not equal rights, it is estimated that this is why Google proposed sequence to Sequence model found that the input sentence in reverse order into the translation effect will be better small trick reason). That is why the model does not reflect the cause of attention. This is similar to what you see in front of you, but without paying attention to the focus. If the Encoder-decoder framework for explaining the distraction model is better understood by machine translation, for example, the input is an English sentence: Tom Chase Jerry,encoder-decoder frame gradually generates Chinese words: "Tom", "Chase", "Jerry". When translating the Chinese word "Jerry", every English word in the distraction model is the same for the translation of the target word "Jerry", which is obviously not very reasonable, and obviously "Jerry" is more important to the translation of "Jerry", but the distraction model cannot reflect that. That's why it's not bringing attention. The lack of attention to the model in the input sentence is less than the estimated problem, but if the input sentence is longer, at this time all the semantics are completely through an intermediate semantic vector, the word itself has disappeared, it is conceivable to lose a lot of detail information, which is why to introduce the attention model of the important reason.

In the above example, if the AM model is introduced, it should be translated "Jerry" when the English words for the translation of the current Chinese words of different degrees of influence, such as given a similar probability distribution value:

(tom,0.3) (chase,0.2) (jerry,0.5)

The probability of each English word represents the amount of attention that the attention distribution model assigns to different English words when translating the current word "Jerry". This is certainly helpful in translating target words correctly, as new information is introduced. In the same vein, each word in the target sentence should learn the information about the distribution probability of the word in the corresponding source sentence. This means that when each word Yi is generated, it is originally the same intermediate semantic representation that C is replaced by a CI that is constantly changing based on the currently generated word. The key to understanding the AM model is here, that is, by a fixed intermediate semantic representation C replaced by a CI that adjusts to the change of the attention model according to the current output word. The Encoder-decoder framework, which adds the AM model, is understood in 2.

Figure 2 Introduction of the Encoder-decoder framework of the AM model

That is, the process of generating the target sentence word is in the following form:

Each CI may correspond to the distribution of the attention distribution of different source sentence words, for example, for the English-Chinese translation above, the corresponding information may be as follows:

wherein, the F2 function represents encoder to input the English word some kind of transformation function, for example if encoder is used RNN model, the result of this F2 function is often the state value of the hidden layer node after input XI at some time G represents encoder the transformation function that synthesizes the middle semantic representation of the whole sentence according to the middle representation of the word, in general, the G function is the weighted summation of the constituent elements, that is, the following formula that is often seen in the paper:

Assuming that I in CI is the "Tom" above, then TX is 3, which represents the length of the input sentence, h1=f ("Tom"), H2=f ("Chase"), H3=f ("Jerry"), the corresponding attention model weights are 0.6 respectively, 0.2,0.2, so the G function is a weighted sum function. If the image is expressed, the translation of the Chinese word "Tom", the corresponding intermediate semantics of the mathematical formula indicates that the CI formation process is similar:

Figure 3 Formation process of CI

Here's another question: How do you know when a word in a target sentence, such as "Tom", is needed to enter the value of the input sentence and the probability of the distribution of the assigned word? Which means "Tom" corresponds to the probability distribution:

(tom,0.6) (chase,0.2) (jerry,0.2)

How did you get it?

For illustrative purposes, we assume that the encoder-decoder frame of the non-AM model of Figure 1 is refined, the encoder adopts the RNN model, decoder also adopts the RNN model, which is a more common model configuration, the diagram of Figure 1 is converted to:

Figure 4 RNN as a encoder-decoder framework for a specific model

A general computational process that can be used to explain the probability distribution of the attention distribution is more convenient:

Fig. 5 Calculation of AM attention distribution probability

For decoder with RNN, if you want to generate Yi words, at moment I, we can know the output value of the hidden layer node I moment before generating Yi, and our aim is to calculate the input sentence word "Tom", "Chase", "Jerry" when we generate Yi. For Yi's attention distribution probability distribution, then you can use the I moment of the hidden Layer node state Hi to one by one and the input sentence of each word corresponding to the RNN hidden layer node State HJ comparison, that is, through the function f (hj,hi) to obtain the target word yi and each input word corresponding to the alignment possibilities, This f function may take different methods in different papers, then the output of function f is normalized by Softmax, and the distribution value of the attention distribution probability is obtained according to the interval of probability distribution. Figure 5 shows the alignment probabilities for the input sentence words that correspond when the output word is "Tom". Most AM models use the computational framework described above to calculate the distribution of attention distribution information, and the difference may differ only in the definition of F.

The above content is often referred to in the paper soft Attention model of the basic idea, you can see in the literature most of the AM model is basically this model, the difference is probably just to use this model to solve different application problems. So how to understand the physical meaning of the AM model? It is very reasonable to think of the AM model as a word alignment model in the general literature. Each word generated by the target sentence corresponds to the probability distribution of the input sentence word, which can be understood as the alignment probability of the input sentence word and the target to generate the word, which is very intuitive in the machine translation context: Traditional statistical machine translation typically has a single phrase alignment in the process of doing it, and the attention model actually plays the same role. It is also a smooth idea to interpret the AM model in other applications as the alignment probabilities between input and target sentence words.

Of course, I think conceptually, it is reasonable to understand the AM model as an impact model, that is, when creating a target word, enter the sentence how much each word affects the generation of the word. This kind of thought is also a better way to understand the physical meaning of AM model.

Figure 6 is the paper "a neural Attention model for sentence summarization", Rush with AM model to do a generated digest given a very straightforward example of AM.

Figure 6 Sentence Generation Summary example

In this example, the input sentence for the Encoder-decoder frame is: "Russian Defense Minister Ivanov called Sunday for the creation of a joint front for Combati NG Global terrorism ". Corresponds to the ordinate sentence in the graph. The system-generated summary sentence is: "Russia calls for joint front against terrorism", corresponding to the horizontal axis of the sentence. It can be seen that the model has correctly pumped out the body part of the sentence. Each column in the matrix represents the generated target word that corresponds to the AM allocation probability of each word in the input sentence, and the darker the color represents the greater the probability of being assigned. This example can be helpful to understand am intuitively.

Finally is the advertisement: about AM, we besides this article, next week also will have the sequel: from AM to talk about two kinds of scientific research innovation mode, please do not turn the table, continues to pay attention, thanks.

Attention Model in natural language processing: what and why

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.