Absrtact: As the core technology of most computer vision system, CNN has made great contribution in the field of image classification. Starting from the use case of computer vision, this paper introduces CNN and its advantages in natural language processing and its function.
When we hear convolutional neural networks (convolutional neural Network, CNNs), we tend to associate computer vision. CNNs has made great contributions to the field of image classification and is the core technology of most computer vision systems today, from Facebook's auto-tagging to auto-driving cars.
Recently we have started to apply CNNs in the field of natural language processing (Natural Language processing) and have achieved some notable results. I will summarize in this article what is CNNs and how to apply them to NLP. The intuitive knowledge behind CNNs is easier to understand in the case of computer vision, so I start from there and slowly transition to natural language processing.
What is a convolution operation?
For me, the easiest way to understand this is to think of convolution as a sliding window function that acts on a matrix. That's a bit of a mouthful, but it's intuitive to use animated displays.
The 3x3 filter does the convolution operation. Image source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Think of the matrix on the left as a black and white image. Each element corresponds to one pixel, 0 for the black Point and 1 for the white spot (the pixel value of the grayscale chart is generally 0~255). Mobile windows are also known as cores, filters, or feature detectors. Here we use the 3x3 filter, which multiplies the elements of the filter and the corresponding parts of the matrix, and sums them up. We pan the window so that it sweeps through all the pixels of the matrix, making a convolution operation on the entire image.
You may have some doubts about the effect of the operation just now. Let's look at a few intuitive examples.
The effect of image blurring is achieved by replacing its original value with the mean of the neighboring point pixel value:
The effect of edge detection is achieved by replacing its original value with the difference between the neighboring pixel value and its own value:
(to intuitively understand, think of those parts of the image that are smooth, those pixels are almost identical to the color of the surrounding pixels: the result of the summation is approaching 0, which is equivalent to black.) If there is an obvious edge line, such as a black-and-white boundary, then the difference between the pixel values will be large, the equivalent of white)
There are some other examples in The GIMP Handbook. To learn more about the fundamentals of convolutional operations, I recommend reading the featured blogs that Chris Olah wrote.
What is convolution neural network?
Now you understand what a convolution is. What is that CNNs? CNNs is essentially a multi-layer convolution operation, plus a non-linear activation function for each layer of output, such as with Relu and Tanh. In the traditional Feedforward neural network, we connect each input neuron to the output neuron of the next layer. This approach is also known as an all-connected layer, or an affine layer. Instead of doing this in CNNs, we use the convolution results of the input layer to calculate the output. This is equivalent to a local connection, where each local input area is connected to a neuron of the output. Apply different filters to each layer, often as shown in hundreds of thousands, and then summarize their results. This also involves pooling layers (downsampling), which I'll explain later. At the training stage, CNN automatically learns the weight of the filter based on the task you want to accomplish. For example, in the image classification problem, the first CNN model may learn to detect some edge lines from the original pixels, then detect some simple shapes on the second layer based on the edge lines, and then detect more advanced features based on these shapes, such as facial contours. The last layer is a classifier that leverages these advanced features.
This calculation has two points worth our attention: positional invariance and composition. For example, do you want to classify an elephant in a picture? Because the filter is shifted across the full extent, it is not used to care about exactly where the elephant is in the picture. In fact, pooling also contributes to the invariance of panning, rotation, and scaling, which is especially good for overcoming scaling factors. The second key factor is the (partial) combination. Each filter forms a more advanced feature representation of the low-level feature combination of a small local area. This is also the reason why CNNs has a great effect on computer vision. We can intuitively understand that lines are made up of pixel points, the basic shapes are made up of lines, and more complex objects come from basic shapes.
So, how do you use them for NLP?
The input to the NLP task is no longer a pixel, in most cases a matrix representation of a sentence or document. Each line of the matrix corresponds to a word-breaker element, which is usually a word or a character. In other words, each line is a vector representing a word. Typically, these vectors are in the form of Word embeddings (a bottom-dimensional representation), such as Word2vec and glove, but can also be in the form of one-hot vectors, which are based on the index of words in the thesaurus. If you use a 100-dimensional word vector to represent a sentence of 10 words, we will get a 10x100 matrix as input. This matrix is equivalent to a "picture".
In the case of computer vision, our filters operate on only a small area of the image at a time, but the filter usually covers the upper and lower lines (several words) when dealing with natural languages. Therefore, the width of the filter is equal to the width of the input matrix. Although the height, or area size can be adjusted arbitrarily, the general sliding window coverage is 2~5 line. In summary, the structure of the convolution neural network dealing with natural language is like this (take a few minutes to understand the picture and how the dimension changes.) You can temporarily ignore the pooling operation and we'll explain it later):
convolutional neural Network (CNN) structure for sentence classifier. Here we set three sizes for the filter: 2, 3, and 4 lines, each with two filters. Each filter makes convolution operations on the sentence matrix to obtain (varying degrees) a dictionary of features. Then the maximum value of each feature dictionary is pooled, that is, only the maximum value of each feature dictionary is recorded. In this way, a string of univariate eigenvectors (univariate feature vectors) is generated from six dictionaries, and these six feature mosaics form a eigenvector that is passed to the second-to-last level of the network. The final Softmax layer takes this eigenvector as input and uses it to classify the sentences; we assume this is a two classification problem, so we get two possible output states. Source: Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners ' Guide to) convolutional neural Networks for sentence classification.
The perfect visual sense of computer vision does it still exist? Positional invariance and local composition are intuitive for images, but not for NLP. You may be concerned about where a word appears in a sentence. Adjacent pixels are likely to be associated (all the same part of an object), but the word is not always the case. In many languages, phrases are separated by many other words. Similarly, the composition is not necessarily obvious. Words are obviously combined in some way, such as adjectives to modify nouns, but if you want to understand what the more advanced features really mean, it is not as obvious as computer vision.
In this view, convolutional neural networks do not seem to be suitable for NLP tasks. Recursive neural networks (recurrent neural network) are more intuitive. They mimic the way we humans deal with languages (at least in our own way): From left to right in order to read. Fortunately, this does not mean that CNNs has no effect. All models are wrong, but some can be exploited. In fact, CNNs is ideal for NLP problems. Just like the bag of Words model, it is obviously a simplistic model based on false assumptions, but it does not affect the standard way it has been used for many years as a NLP, and has achieved good results.
The main feature of CNNs is its fast speed. It's very fast. Convolution is a core part of computer image, which is implemented at the hardware layer of GPU level. More efficient than the N-grams,cnns characterization approach. Because the dictionary is large, any computational overhead that exceeds 3-grams is very large. Even Google does not exceed 5-grams. Convolution filter can automatically learn the good representation, do not need to use the whole vocabulary to characterize. It is perfectly reasonable to use a filter with a size greater than 5 rows. I personally think that many of the filters caught in the first layer are very similar (but not limited) to n-grams, but are characterized in a more compact manner.
CNN's Hyper-Parameter
Before explaining how to use CNNs for NLP tasks, take a look at several options to build a CNN network. Hopefully this will help you to better understand the relevant literature.
Narrow convolution vs wide convolution
In explaining the convolution operation above, I ignored a small detail on how to use the filter. There is no problem using 3x3 filters in the middle of the matrix, what to do at the edge of the matrix? The upper-left element does not have the top and left adjacent elements, how to filter it? The solution is to adopt the 0 method (zero-padding). All element values falling outside the matrix range default to 0. This makes it possible to filter each element of the input matrix and output a matrix of the same size or larger. The complement 0 method is also called the wide convolution, the method that does not use the complement zero is called the narrow convolution. Example of 1D:
Narrow convolution vs wide convolution. The filter length is 5 and the input length is 7. Source: A convolutional neural Network for modelling sentences (2014)
When the length of the filter is relative to the length of the input vector, you will find that the wide convolution is useful or necessary. In, the length of the narrow convolution output is (7-5) +1=3, and the length of the wide convolution output is (7+2*4-5) +1=11. The general form is
Step
Another super parameter of the convolution operation is the step size, which is the distance of each filter translation. The steps in all of the above examples are 1 and the adjacent two filters overlap. The larger the step size, the fewer filters are used and the less the output value. The cs231 course webpage from Stanford is in the case of steps 1 and 2, respectively:
Convolution step. Left: Step is 1, right: Step is 2. Source: http://cs231n.github.io/convolutional-networks/
In the literature we often see the step size is 1, but the choice of a larger step will make the model closer to the recurrent neural network, its structure is like a tree.
Pooling Layer
An important concept of convolutional neural networks is the pooling layer, which is usually after the convolution layer. The pooling layer makes a drop-down sample of the input. A common pooling practice is to maximize the output of each filter. We do not need to pool the whole matrix, we can only pool a certain window interval. For example, the maximum pooling of 2x2 windows is shown (in NLP, we typically pool the entire output, with only one output value per filter):
The largest pool of CNN. Source: http://cs231n.github.io/convolutional-networks/#pool
Why Pool? There are many reasons.
One of the characteristics of pooling is that it outputs a fixed-size matrix, which is necessary for classification problems. For example, if you are using 1000 filters and use the maximum pooling for each output, you will get a 1000-D output regardless of the size of the filter and regardless of how the dimension of the input data changes. This allows you to apply different lengths of sentences and filters of different sizes, but always get an output of the same dimension, passing in the next layer of the classifier.
The pooling also reduces the dimensions of the output, which, ideally, preserves significant features. You can think of each filter as detecting a particular feature, such as detecting whether a sentence contains negative meanings such as "not amazing". If the phrase appears somewhere in the sentence, the output value of the filter in the corresponding position will be very large, and the output value in other locations is very small. By taking the maximum value, it is possible to retain the information that a feature appears in the sentence, but it is not possible to determine exactly where it appears in the sentence. Is the location of this information really important? Indeed, it is somewhat similar to the behavior of a set of n-grams models. Although the global information about the position is lost (in the approximate position of the sentence), the local information captured by the filter is preserved, such as "Not Amazing" and "amazing not" mean very different things.
In the field of image recognition, pooling can also provide translational and rotational invariance. If a zone is pooled, even if the image pans/rotates several pixels, the resulting output value is basically the same, because the results of each maximum operation are always the same.
Channel
The last concept we need to understand is the channel. A channel is a different "perspective" of the input data. For example, when doing image recognition, RGB channels (red, green and blue) are commonly used. You can do convolutional operations on each channel, giving the same or different weights. You can also think of NLP as having many channels: the representation of different classes of word vectors (for example, Word2vec and glove) as separate channels, or the same sentence in different languages as a channel.
Application of convolutional neural network in natural language processing
Let's look at the practical application of convolutional neural network model in the field of natural language processing. I'm trying to summarize some of the results. Hopefully, you'll be able to cover most of the mainstream results, and you'll inevitably miss out on other interesting apps (please remind me in the comments section).
The best fit for CNNs is classified tasks such as semantic analysis, spam detection, and topic categorization. Convolution operations and pooling can lose order information for some words in the local area, so the pure CNN framework does not work well for sequential label tasks such as POS tagging and entity extraction (nor is it impossible for you to try to enter location-related features).
The literature [1> evaluates CNN models on different categorical datasets, mainly based on semantic analysis and topic classification tasks. The CNN model performed well on each data set, and even the best results were refreshed individually. Surprisingly, the network structure used in this article is very simple, but the effect is pretty good. The input layer is a matrix that represents the sentence, and each line is a vector of word2vec words. Next is a convolution layer consisting of several filters, then the largest pool layer, and finally the Softmax classifier. The paper also tries two different forms of channels, namely static and dynamic word vectors, one of which is dynamically adjusted while training and the other is unchanged. A similar structure is mentioned in the literature [2], but more complex. The article [6] added another layer to the network for semantic clustering.
Kim, Y. (2014). convolutional neural networks for sentence classification
The literature [4] trains the CNN model from raw data without pre-training to get Word2vec or glove vector representations. It directly convolution the one-hot vector. The author uses a space-saving method to characterize the input data to reduce the number of parameters that the network needs to learn. In the literature [5], the author uses the unsupervised "region embedding" learned by CNN to extend the model and predict the contextual content of the text area. The methods mentioned in these papers are very effective in dealing with long texts (such as film critics), but the effect on short text (such as Twitter) is unclear. With my intuition, the use of pre-trained word vectors for short passages should be better than long text.
Building a CNN model structure requires selecting a number of hyper-parameters, which I have mentioned above: input characterization (Word2vec, GloVe, One-hot), volume and size of convolution filters, pooling strategies (maximum, average), and activation functions (ReLU, Tanh). The influence of different hyper-parameters on the performance and stability of CNN model structure is compared by repeated experiments in [7]. If you want to implement a CNN for text categorization, you can draw on the results of this paper. Its main conclusion is that the maximum pooling effect is better than average pooling; it is important to choose the ideal filter size, but it is also dependent on the task, and the regularization in the NLP task is not obvious. It is important to note that the text in the text set of the institute is similar in length, so if you want to deal with different length of text, the above conclusions may not be instructive.
The article [8] explores the application of CNNs in relational Mining and relational classification tasks. In addition to the word vector representation, the author also takes the relative position of the word and the word as the input value of the convolution layer. This model assumes that the position of all text elements is known, and that each input sample contains only one relationship. The literature [9] is similar to the model used in the literature [10].
The literature from Microsoft Research [11] and [12] describes another interesting application of CNNs in NLP. These two papers describe how to learn to express a sentence as a structure containing semantics, which can be used for information retrieval. The example given in the paper is based on the user's current reading content and recommends other documents of interest to it. The representation of sentences is based on the log data training of the search engine.
Most CNN models learn word vector representations of words and sentences in one way or another, as part of the training process. Not all papers are concerned with this step of the training process, and do not care about the significance of the learned representation. The article [13] describes a CNN model for tagging Facebook's logs. These learned word vectors were then successfully applied to another task--based on a click-through log to recommend articles of interest to the user.
CNNs model of character plane
At this point, all the model representations are at the word level. Other teams are studying how to use the CNNs model directly with characters. The literature [14] learned the vector characterization of the character plane, combining them with pre-trained word vectors to label the voice. Literature [15] and [16] studied directly using the CNNs model directly from the character learning, without having to pre-train the word vector. It is worth noting that the author uses a relatively deep network structure, a total of 9 layers, to complete the semantic analysis and text classification tasks. The results show that the effect of using character-level input directly on large-scale datasets (millions) is very good, but the learning effect on small datasets (level 100,000) is generally achieved with simple models. The literature [17] is about the application of character-level convolution in language modeling, and the output of the character-level CNN model as input to each step of the LSTM model. The same model is used in different languages.
Surprisingly, all of the above papers were published nearly two years ago. It is clear that the CNNs model has performed well in the field of NLP, with new achievements and top-level systems emerging in endlessly.
If you have any questions or feedback, please leave a comment in the comments section. Thank you for reading!
Reference documents
[1] Kim, Y. (2014). convolutional neural Networks for sentence classification. Proceedings
Of the Conference on empirical Methods in Natural Language processing (EMNLP 2014), 1746–1751.
[2] Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional Neural
Network for modelling sentences. ACL, 655–665.
[3] Santos, C. N dos, & Gatti, M. (2014). Deep convolutional Neural Networks
For sentiment analysis of the short texts. In COLING-2014 (pp. 69–78).
[4] Johnson, R., & Zhang, T. (2015). Effective use of the Word Order for Text
Categorization with convolutional neural Networks. To Appear:naacl-2015, (2011).
[5] Johnson, R., & Zhang, T. (2015). semi-supervised convolutional neural Networks for Text categorization via region embedding.
[6] Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic
Clustering and convolutional neural Network for short Text categorization. Proceedings ACL 2015, 352–357.
[7] Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and Practitioners's Guide)
convolutional neural Networks for sentence classification,
[8] Nguyen, T. H., & Grishman, R. (2015). Relation extraction:perspective from convolutional
Neural Networks. Workshop on Vector Modeling for NLP, 39–48.
[9] Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling mention, Context
and entity with neural Networks for Entity disambiguation, (IJCAI), 1333–1339.
Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation Classification via
Convolutional deep Neural Network. Coling, (2011), 2335–2344.
[One] Gao, J., Pantel, P., Gamon, M., He, X., & Deng, L. (2014). Modeling interestingness with deep neural Networks.
[12]shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A latent Semantic Model withconvolutional-pooling Structure for information retrieval. Proceedings of the 23rd ACM International Conference on
Conference on information and knowledge management–cikm ' 14, 101–110.
[13]weston, J., & Adams, K.. # T AG S pace:semantic embeddings from hashtags, 1822–182 7.
[14] santos, C., & Zadrozny, B. (2014). Learning Character-level representations for Part-of-speech Tagging. Proceedings of the 31st International Conference on Machine Learning, ICML-14 (2011), 1818–1826.
[15] zhang , X.zhao, J., & LeCun, Y. (2015). Character-level convolutional Networks for Text classification, 1–9.
[16] zhang, X., & LeCun, Y. (2015). Text Understanding from Scratch. ArXiv e-prints, 3, 011102.
[17] kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-aware neural Language Models.
Original link: Understanding convolutional neural Networks for NLP (Translator/Zhao audit/heredity, Zhu Zhengju Zebian/Zhou Jianding original/translation submissions please contact: [Email protected], No.: Jianding_zhou)
Translator Profile: Zhao, Computational advertising engineer @ Sogou, former biomedical engineer, focus on recommendation algorithms, Machine learning field.
Citation: http://www.csdn.net/article/2015-11-11/2826192
Http://blog.sciencenet.cn/blog-1225851-935359.html
Application of CNN convolutional Neural network in natural language processing