The text categorization task can use CNN to extract key information similar to N-gram in sentences.
The detailed process schematic of the TEXTCNN is shown below:
Keras Code:
1 defConvs_block (data, Convs=[3, 3, 4, 5, 5, 7, 7], f=256):2Pools = []3 forCinchConvs:4Conv = Activation (activation="Relu") (Batchnormalization () (5CONV1D (Filters=f, Kernel_size=c, padding="valid"(data)))6Pool =globalmaxpool1d () (CONV)7 pools.append (Pool)8 returnconcatenate (pools)9 Ten One defRNN_V1 (Seq_length, Embed_weight, pretrain=False): A -Main_input = input (shape= (Seq_length,), dtype='float64') - theIn_dim, Out_dim =Embed_weight.shape -embedding = Embedding (Input_dim=in_dim, weights=[ -Embed_weight], Output_dim=out_dim, trainable=False) -Content = Activation (activation="Relu")( +Batchnormalization () (timedistributed (Dense (256))) (Embedding (main_input) )))) -Content = bidirectional (GRU (256) ) (content) +Content = Dropout (0.3) (content) AFC = Activation (activation="Relu")( atBatchnormalization () (Dense (256(content) )) -Main_output = Dense (3, -activation='Softmax') (FC) - -Model = Model (Inputs=main_input, outputs=main_output) -Model.compile (optimizer='Adam', inloss='categorical_crossentropy', -metrics=['accuracy']) to model.summary () + returnModel
The description is as follows:
, suppose the sentence hasNWord, the vector has a dimension of K , then this matrix is nxk.
The type of the matrix can be static (static), or it can be dynamic (non static). Static is the word vector is fixed, and dynamic is in the model training process, Word vector is also considered to be an optimized parameter, usually the reverse error propagation causes the value of Word vector changes in this process called Fine tune
.
For vectors with no sign-in words, you can fill them with 0 or a random, small positive number.
- First Layer convolutional layer
The input layer is obtained by convolution operation Feature Map
, the size of the convolution window is M*k , where M is the N in N_gram, and the convolution will get the F column number 1.Feature Map,F表示卷积核的个数。
Next to the pooling layer, the article uses a method called Max-over-time Pooling
. This method is simply to raise the maximum value from the previous dimension, the interpretation of the Feature Map
maximum value represents the most important signal.
The output of the final pooled layer is the maximum of each Feature Map
, i.e. a one-dimensional vector. The polling is followed by a one-dimensional vector of 1*f.
- Full Connection + Softmax layer
The output of one-dimensional vectors of a pooled layer connects a softmax layer through an all-connected way.
In the final implementation, we can use the technology on the full-connected part of the penultimate layer Dropout
, which is the restriction on the weight parameter L2正则化
on the full connection layer. The benefit of this is to prevent the hidden layer elements from being adaptive (or symmetrical), thereby reducing the degree of overfitting.
Experimental section
1. Data
The data set used in the experiment is as follows (the specific name and source can refer to the paper):
2. Model Training and parameter tuning
- Fixed linear unit (rectified linear units)
- The h size of the filter: 3,4,5; the number of the corresponding feature map is 100;
- The dropout rate is 0.5,L2 regularization limit weight value does not exceed 3;
- The size of the mini-batch is 50;
The selection of these parameters is based on the SST-2 dev dataset and the optimal parameters obtained by grid search method. In addition, using the random gradient descent method in the training process, based on shuffled mini-batches, the adadelta update rule (Zeiler, 2012) was used.
3. Pre-trained word vectors
Word vectors here use public data, the training results of the continuous word bag model (COW) on Google News. The vector value of the not-logged-in times is randomly initialized.
4. Experimental results
The experimental results are as follows:
Among them, the first four models are the various variants of the basic models presented above:
- Cnn-rand: All word vectors are randomly initialized and can be trained in parameters.
- cnn-static: Google's Word2vector tool (Cbow model) results, can not be trained;
- cnn-non-static: Google's Word2vector tool (Cbow model) results, but will be in the training process
Fine tuned
;
- Cnn-multichannel: Mixed versions of cnn-static and cnn-non-static, i.e. two types of inputs;
5. Conclusion
CNN-static
Better, the pre-training word vector does have a bigger boost ( because pre-training's word vector obviously exploits larger text data); CNN-rand
CNN-non-static
Better than most, the appropriate fine tune is also advantageous, because makes the vectors more close to the specific task; CNN-static
CNN-multichannel
Better performance on smaller datasets actually CNN-multichannel
embodies a compromise that does not want the fine tuned vector to be too far from the original value, but retains its space for change. CNN-single
It is noteworthy that there are some interesting phenomena in the static vector and non-static compared to the following table:
- In the original Word2vector training result, the
bad
corresponding closest word is the good
same, because the two words are very similar in syntax (can be simply replaced without the problem of the sentence), and in non-static
the version, the bad
corresponding most similar word is terrible
, This is because in Fune tune
the process, the value of the vector changes so that the data set is more appropriate (is an emotional classification of the data set), so in terms of emotional expression of the two words will be closer;
- The closest words
!
in a sentence are those that are more radical in expression, such as, lush
,
etc., and are close to some connectives, which are consistent with our subjective feelings.
This model of Kim y is simple, but it has good performance. Follow-up Denny with TensorFlow to achieve a simple version of the model, you can refer to this blog post , and Ye Zhang and other people on the model carried out a large number of experiments, and give the recommendations of the Assistant, can refer to this paper .
The following concludes that Ye Zhang and others based on the Kim Y model made a large number of parameters after the tuning experiment:
- Due to the randomness of the model training process, such as the random initialization weight parameter, the Mini-batch, the stochastic gradient descent optimization algorithm, the result of the model in the data set has a certain float, such as the accuracy rate (accuracy) can reach 1.5% of the float, and the AUC has 3.4% floating;
- The word vectors are used word2vec or glove, which has some influence on the experimental results, which is better depends on the task itself;
- The size of the filter has a greater effect on the performance of the model, and the filter parameter should be updatable;
- The number of Feature maps also has a certain effect, but need to take into account the training efficiency of the model;
- The way of 1-max pooling is good enough, compared to other pooling ways;
- The role of regularization is negligible.
Ye Zhang and other people give the model to the participants of the recommendations are as follows:
- It
non-static
is much word2vec
better GloVe
to use the version or to achieve it than the simple one-hot representation
result;
- In order to find the optimal filter size, a linear search method can be used. Usually the size of the filter range
1-10
between, of course, for long sentences, the use of larger filters is also necessary;
Feature Map
The number of 100-600
in between;
- As much as possible to try to activate the
ReLU
function tanh
, the experiment found and two kinds of activation function better performance;
- The use of
1-max pooling
Simple is enough, it is not necessary to set too complex pooling way;
- When the increase in the number of the performance of the
Feature Map
model decreases, you can consider increasing the intensity of the regular, such as dropout
the probability of higher;
- In order to test the performance level of the model, repeated cross-validation is necessary to ensure that the model's high performance is not accidental.
The appendix of the paper also attaches a variety of adjustment results, interested can go to read.
TEXTCNN detailed process: The first layer is the leftmost 7 by 5 of the sentence matrix, each line is the word vector, the dimension = 5, which can be likened to the original pixels in the image point. Then through a one-dimensional convolution layer with filter_size= (2,3,4), each filter_size has two output channel. The third layer is a 1-max pooling layer, so that the different length of the sentence after the pooling layer can become a fixed-length representation, and finally a layer of fully connected softmax layer, the probability of output each category.
Feature: The feature here is the word vector, which has static (statically) and non-static (non-static) modes. Static mode uses such as Word2vec pre-trained word vector, training process does not update the word vector, essentially belongs to the migration learning, especially in the case of small data volume, the use of static word vectors tend to be good results. Non-static is to update the word vectors during the training process. The recommended way is non-static in the fine-tunning mode, it is pre-training (pre-train) Word2vec vector initialization word vector, the training process to adjust the word vector, can accelerate convergence, of course, if there is sufficient training data and resources, The direct random initialization of the word vector effect is also possible.
Channel (Channels): An image can take advantage of (R, G, B) as a different channel, while the input channel of the text is usually a different way of embedding (such as Word2vec or glove), In practice, the use of static word vectors and fine-tunning-word vectors as different channel methods are also used.
One dimensional convolution (conv-1d): The image is a two-dimensional data, the word vector expression of the text is one-dimensional data, so in textcnn convolution is a one-dimensional convolution. The problem with one-dimensional convolution is the need to design different widths of views from different filter_size of the filter.
Pooling layer: Using CNN to solve the text classification problem is still a lot of articles, such as this A convolutional neural Network for modelling sentences The most interesting input is changed in pooling (dynamic) K-max pooling, the pooling stage retains the largest information of K, preserving the global sequence information. For example, in an emotional analysis scenario, for example:
“ 我觉得这个地方景色还不错,但是人也实在太多了 ”
Although the first half of the emotion is positive, the global text expresses the negative emotion, using K-max pooling can capture this kind of information very well.
"Convolutional neural Networks for sentence classification" speed Reading