Transferred from: http://blog.csdn.net/malefactor/article/details/51078135
CNN is currently the two most common deep learning models for natural language processing and RNN. Figure 1 shows a typical network structure that uses the CNN model in NLP tasks. In general, the input word or word is expressed in Word embedding, so that a one-dimensional text information input is converted into a two-dimensional input structure, assuming that the input x contains m characters, and each character's word embedding length is D, then the input is m*d two-dimensional vector.
Figure 1 Typical network structure of CNN model in natural language processing
As can be seen here, because the length of the sentence in NLP is different, the size of the input matrix of the CNN is indeterminate, depending on how large the M is. Convolution layer is essentially a feature extraction layer, you can set the super-parameter F to specify how many feature extractor (filter), for a filter, you can imagine a k*d size of the mobile window from the input matrix began to move backward, where K is the Filter specified window size , d is the length of word embedding. For a time window, through the nonlinear transformation of the neural network, the input value in this window is converted to a characteristic value, and as the window moves backwards, the filter corresponding eigenvalue is constantly generated, forming the filter's eigenvector. This is the process of extracting features from the convolution layer. Each filter is operated in such a way that a different feature extractor is formed. The Pooling layer will reduce the dimension of the filter and form the final characteristic. In general, after the pooling layer is connected to the whole-join layer neural network, the final classification process is formed.
As can be seen, convolution and pooling are the two most important steps in CNN. Below we focus on the common pooling operation methods of the CNN model in NLP.
| Max Pooling over time operation in CNN
Maxpooling over time is the most common type of down-sampling operation in the CNN model of NLP. This means that for a certain filter to be extracted to a number of eigenvalues, only the value with the highest score is reserved for the pooling layer, the other eigenvalues are all discarded, the maximum value represents only the strongest of these features, and discards other weak such features.
There are several advantages to using the max pooling operation in CNN: First, this operation ensures that the position and rotation of the feature are invariant, because no matter where the strong feature appears, it can be presented without regard to its occurrence. For image processing this position and rotation invariance is a good feature, but for NLP, this feature is not necessarily a good thing, because in many NLP applications, the appearance of the characteristics of the location information is very important, such as the subject appears in the sentence head, the object generally appears in the end of the sentence and so on, These location information is sometimes important for categorical tasks, but Max Pooling basically throws that information away.
Secondly, maxpooling can reduce the number of model parameters, which is helpful to reduce the problem of model overfitting. Because after pooling operation, the 2D or 1D array is often converted to a single value, so for the subsequent convolution layer or the whole join hidden layer, there is no doubt that the number of individual filter parameters or hidden layer neurons is reduced.
Furthermore, Max pooling has an additional benefit for NLP tasks, where you can organize the variable-length input x into fixed-lengths input. Because CNN will often end up with the full join layer, and its number of neurons need to be predetermined, if the input is not long, it is difficult to design the network structure. As previously mentioned, the input x length of the CNN model is indeterminate, and by pooling operation, each filter is fixed to 1 values, then how many filter,pooling layers have the number of neurons, so that the total number of connected layer neurons can be fixed (2), This advantage is also very important.
Figure 2. Number of neurons in pooling layer equals filters number
However, the CNN model takes maxpooling over time with some notable drawbacks: first, as mentioned above, the location information of the feature is completely lost in this step. The convolution layer actually retains the location information of the feature, but by taking a unique maximum value, the pooling layer now knows only what the maximum is, but its occurrence position information is not preserved, and the other obvious disadvantage is that sometimes some strong features occur several times, For example, our common TF.IDF formula, TF refers to the number of occurrences of a feature, the more the number of occurrences indicates that the stronger the feature, but because Max pooling only retains a maximum value, so even if a feature appears multiple times, now can only see once, that is, the strength of the same feature information is missing. This is a typical two disadvantage of Max Pooling over time.
In fact, we often say "crisis crisis", the optimistic interpretation of this word is "danger is opportunity". In the same vein, discovering the shortcomings of the model is a good thing, because innovation is often caused by improving the shortcomings of the model. So how to improve the mechanism of pooling layer can alleviate the above problems? The following two common improvement pooling mechanisms are doing the same thing.
| K-max Pooling
K-maxpooling means: The original Max Pooling over time from the convolution layer of a series of eigenvalues to take the strongest of the value, then we can expand the idea, K-max Pooling can take all the eigenvalues of the score in top– The value of K and retains the original order of the eigenvalues (Figure 3 is 2-max pooling), that is, by preserving some feature information for subsequent phases.
Figure 3.2-max Pooling
It is obvious that the K-max pooling can express the same kind of characteristics multiple times, that is, the intensity of a certain kind of feature can be expressed, and because the relative order of these top K eigenvalues is preserved, it should be said that it retains part of the location information, but this position information is only the relative order between the features. Rather than absolute location information.
| Chunk-max Pooling
The idea of chunk-maxpooling is to segment all the eigenvectors of the convolution layer corresponding to a filter, cut into segments, and obtain a maximum eigenvalue in each segment, for example, to cut a filter's eigenvector into 3 Chunk, Then take a maximum value in each chunk, and get 3 eigenvalues. (shown in 4, different colors represent different segments)
Figure 4. Chunk-max Pooling
At first glance Chunk-max Pooling idea is similar to K-max Pooling, because it is also removed from the convolution layer K eigenvalues, but the main difference between the two is: K-max Pooling is a global top K feature operation mode, And Chunk-max pooling is the first segment, in the segment contains the characteristics of the data inside the maximum value, so in fact, is a local top K feature extraction method.
As for the chunk how to divide, can have a different approach, such as can be set in advance the number of paragraphs, which is a static division of chunk ideas, but also according to the input of the different dynamically divided between the chunk boundary position, It can be called the dynamic Chunk-max method (this appellation is my name, informal title, please note).
Chunk-max pooling is obviously the relative order information that retains more than one local Max eigenvalue, although it does not retain absolute position information, but because it divides the Chunk and then takes the Max value separately, it retains the relatively coarse-grained fuzzy position information; Of course, if strong features occur more than once , you can also capture the feature strength.
Event Extraction via dynamic multi-pooling convolutional neural Networks This paper presents a variant of chunkpooling, that is, the dynamic Chunk-max Pooling's idea, the experiment proves that the performance has improved. Local translation prediction with Global sentence representation This paper also proves that static Chunk-max performance is relatively maxpooling over The time method improves application performance in machine translation applications.
If you think about it, you will find that if the location information of the key features required by the classification is important, then the mechanism similar to Chunk-max pooling, which can retain location information in coarse granularity, should be able to improve the classification performance to a certain extent, but for many classification problems, Estimating max-pooling over time is sufficient.
For example, if we take the classification of emotions, it should be helpful to estimate the Chunk-max strategy, because for this expression pattern:
"Blablabla .... Praise you for half a day, but ... you're essentially a slag."
With this expression pattern
"Although you are a slag, but ... Blablabla. Aubagne I still think you're the best, because you're the most handsome.
The obvious position information is helpful to discriminate the whole emotion tendency, so the introduction of location information should be helpful.
So, you analyze your problem, see if the location is an important feature, if so, then apply the CHUNK-MAX strategy, estimate performance will be improved, such as the above-mentioned sentiment classification problem estimation effect will be improved.
MORE: http://blog.csdn.net/stdcoutzyx/article/details/41596663
[to] understand the convolution &&pooling in NLP