How CNN applies to NLP
What is convolution and what is convolution neural network is not spoken, Google. Starting with the application of natural language processing (so, how does any of this apply to NLP?).
Unlike image pixels, a matrix is used in natural language processing to represent a sentence or a passage as input, and each row of the matrix represents a token, either a word or a character. So each row is a vector, and this vector can be a word vector like word2vec or glove. It can also be a one-hot vector. If a sentence has 10 words, each word is a 100-dimensional word vector, then get the 10*100 matrix, which is equivalent to image recognition in the image (input).
In the image, the filter slides in the part of the image, while in NLP the filter slides across the entire line. It means that the width of the filter and the width of the input matrix are uniformly. (This means that the width of the filter equals the dimension of the word vector.) It is often a sliding window with 2-5 words in height. To sum up, one of the CNN long in NLP is like this:
There are 3 kinds of filters, the sliding window is 2, 3, 4, each has 2. The following is a description of CNN's lack of NLP (not understood). RNN more in line with the language of the understanding of the habit. There is a discrepancy between the model and the implementation, but CNN's performance in NLP is good. At the same time also spit the word bag model is the same. (Reason ghosts know)
Another advantage of CNN is that it's quick, and this compares it with the N-gram model. We all know that it is scary to adopt the 3-gram dimension in the VSM model, and that Google cannot handle more than 5-gram models. This is the advantage of the CNN model, while using the n-size sliding window and n-gram processing in the CNN input layer is similar. (Can not agree more, personally think part of the credit in Word embeddings.) Of course not all, because even with one-hot, the dimension will not change with the size of the window. In N-gram, it increases with the change of N. ) CNN's Super Ginseng
(Dry goods are necessary to understand the model and code.) ) narrow convolution and wide convolution (narrow VS. WIDE convolution)
For narrow convolution, the convolution is done from the first point, and each window slides a fixed stride. For example, the left part of the following figure is a narrow convolution. Then notice that the more times the edges are being rolled down the less. So there is a wide convolution method, can be seen in the convolution before the edge with 0 supplements, common in two cases, one is full complement, into the right part of the figure, so that the output is greater than the input dimension. Another commonly used method is to supplement a 0 value so that the output and input dimensions are consistent. In this paper, we give a formula. Here npadding in the full complement is filter-1, when the input and output are equal, the main parity, note that the convolution kernel is often odd, there should be one of the reasons. (Think why)
Stride size (STRIDE size)
This parameter is very simple, that is, the length of the convolution kernel movement. The steps to the left of the following two graphs are 1 and the right step is 2. (See what the convolution nucleus is?)
The stride is often set to 1, and a larger stride is set in some models that are closer to RNN. Convergence layer (pooling LAYERS)
Generally after the convolution layer there will be a convergence layer. The most commonly used is max-pooling (the one that takes the biggest). The size of the stride is generally consistent with the size of the max-pooling window. (a representative operation in NLP is the aggregation of the entire output, with each filter outputting only one value.) )
Why do we have to do the convergence. Two reasons: one is to provide a certain output, for the subsequent full connection is useful. The second is to save most of the information on the premise of the descending dimension (hopefully). It says here that the practice is equivalent to whether a word appears in a sentence without caring where the word appears in the sentence. This is the same idea as the word bag model. The difference is that in the local information, "Not amazing" and "amazing not" are very different in the model. (Think about it here, Mark.) Channel (Channels)
There is nothing to say, but there are several layers of input. In the image generally have 1, 3 layers (respectively, gray and RGB map). There can also be multiple channels in NLP, such as using different words to quantify, or even different languages such as CNN applied to NLP
CNN is often used in NLP for text categorization, such as affective analysis, spam identification, and topic categorization. Because the aggregation operation of the convolution will lose the position information of some words, it is difficult to apply to POS tagging and entity extraction. But it's not that you can't do it, you need to add position information to the feature. Here is CNN's paper on NLP in the author's view.
Here is an example from the paper [1], the model is very simple. The input layer is the sentence represented by the Word2vec word vector, followed by the volume base, then the max-pooling layer, and finally the fully connected Softmax classifier. At the same time, the paper also experimented with the use of two channels, a static one dynamic, one will change in the training (word vector changes. Parameters who will not change, Mark). In the paper [2][6] There are more than one layer to achieve "emotional clustering."
[4] There is no such as Word2vec in the first training, straightforward and rough use of one-hot vector. [5] The author says his model behaves very well in long texts. In summary, the word vector is better than the long text in short text.
What to do to build a CNN Model: 1, the input to quantify the expression. 2. Set the size and quantity of the convolution core. 3, the choice of the type of convergence layer. 4, the activation function selection. The establishment of a good model requires many experiments, and the authors say that if they do not have the ability to build a better model, it will suffice to emulate him. Also has several experience: 1, max-pooling is better than average-pooling. 2, the size of the filter is very important. 3. No eggs are used. 4, warning the best text length is similar.
The rest of the papers will not be said.
[1] Kim, Y. (2014). convolutional neural Networks for sentence classification. Proceedings of the 2014 Conference on empirical Methods in Natural Language processing (EMNLP 2014), 1746–1751.
[2] Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional Neural Network for modelling sentences. ACL, 655–665.
[3] Santos, C. N. DOS, & Gatti, M. (2014). Deep convolutional neural Networks for sentiment analysis of the short texts. In COLING-2014 (pp. 69–78).
[4] Johnson, R., & Zhang, T. (2015). Effective use the Word order for Text categorization with convolutional neural Networks. To Appear:naacl-2015, (2011).
[5] Johnson, R., & Zhang, T. (2015). semi-supervised convolutional neural Networks for Text categorization via Region embedding.
[6] Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic clustering and convolutional neural network for short Text categorization. Proceedings ACL 2015, 352–357.
[7] Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners ' Guide to) convolutional neural Networks for sentence classification,
[8] Nguyen, T. H., & Grishman, R. (2015). Relation extraction:perspective from convolutional neural Networks. Workshop on Vector modeling for NLP, 39–48.
[9] Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling mention, context and Entity with neural Networks for Entity disambiguation, (IJCAI), 1333–1339.
[Ten] Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation classification via convolutional Deep Neural network. Coling, (2011), 2335–2344.
[One] Gao, J., Pantel, P., Gamon, M., he, X, & Deng, L. (2014). Modeling interestingness with Deep neural Networks.
[Shen], Y., he, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent semantic Model with convolutional-pooling Structure for information retrieval. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ' 14, 101 –110.
[Weston], J., & Adams, K. (2014). # T AG S pace:semantic embeddings from hashtags, 1822–1827.
[A] Santos, C., & Zadrozny, B. (2014). Learning character-level representations for Part-of-speech Tagging. Proceedings of the 31st International Conference on Machine Learning, ICML-14 (2011), 1818–1826.
[In] Zhang, X, Zhao, J., & LeCun, Y. (2015). Character-level convolutional Networks for Text classification, 1–9.
[X] Zhang, x., & LeCun, Y. (2015). Text Understanding from Scratch. ArXiv e-prints, 3, 011102.
[M] Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-aware Neural Language Models.
Article reproduced from: http://blog.csdn.net/zhdgk19871218/article/details/51387197