Source: http://mp.weixin.qq.com/s?__biz=MzA3MDg0MjgxNQ==&mid=2652391534&idx=1&sn= 901d5e55971349697e023f196037675d&chksm= 84da48beb3adc1a886e2a0d9d45ced1e8d89d4add88a9b6595f21784fcc461938b19a7385684&mpshare=1&scene=23& Srcid=0904tm0ogmvdf8vkgcmlvb7m#rd
Traditional text emotion classification is simple and easy to understand, and stability is strong, however, there are two difficult to overcome the limitations:
First, the precision problem, the traditional idea is passable, of course, the general application is enough, but to further improve the accuracy, but lack of better methods;
Second, background knowledge issues, traditional thinking needs to extract a good dictionary of emotions in advance, and this step, often requires manual operation to ensure accuracy, in other words, the person doing this, not only if the data mining experts, but also need linguists, this background knowledge dependency problem will hinder the natural language processing progress.
Thankfully, deep learning solves this problem (at least to a large extent), allowing us to model the actual problem in a field on the premise of almost "0 background". This article extends the text emotion classification mentioned in the previous article as an example, and simply explains the depth learning model. The part that has been discussed in detail in the previous article is no longer in detail. Deep learning and natural language processing
In recent years, the depth learning algorithm has been applied to the field of natural language processing and achieved better results than the traditional model. such as Bengio and other scholars based on the idea of deep learning to build a neural probabilistic language model, and further use of a variety of deep neural networks in large-scale English corpus language model training, get a better semantic representation, complete the syntactic analysis and emotional classification of common natural language processing tasks, It provides a new idea for natural language processing in the large data age.
After the author's test, based on the depth of neural network affective analysis model, its accuracy rate is often more than 95%, depth learning algorithm charm and power can be seen.
For further information on further study, please refer to the following literature: [1] Yoshua Bengio, Réjean ducharme Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model, 2003
[2] A new language model: http://blog.sciencenet.cn/blog-795431-647334.html
[3] Deep Learning (deep Learning) Learning notes finishing: http://blog.csdn.net/zouxy09/article/details/8775360
[4] Deep learning:http://deeplearning.net
[5] on Chinese automatic segmentation and semantic recognition: http://www.matrix67.com/blog/archives/4212
[6] Application of Deep Learning in Chinese word segmentation and pos tagging tasks: expression of http://blog.csdn.net/itplus/article/details/13616045 language
The most important step in the modeling process is feature extraction, which is no exception in natural language processing. In natural language processing, one of the core questions is how to express a sentence effectively in the form of numbers. If you can complete this step, the classification of sentences is not a problem. Obviously, one of the initial ideas is: give each word a unique number 1,2,3,4 ..., then think of the sentence as a collection of numbers, such as assuming that 1,2,3,4 represent "me", "You", "Love" and "hate", then "I Love You" is [1, 3, 2], "I hate You" is [1, 4, 2]. The idea seems to work, and it's actually very problematic, like a stable model that says 3 is close to 4, therefore [1, 3, 2] and [1, 4, 2] should give close classification results, but according to our number, 3 and 4 represent the exact opposite of the meaning of the words, the results of the classification can not be the same. Therefore, this coding method is unlikely to give a good result.
Readers may think that I put the number of words with similar meanings in a pile (giving similar numbers). Well, sure, if there's a way to put a close word number together, it's really going to improve the accuracy of the model. But the question is, if you give each word a unique number, and you set the similar number to be similar, it actually assumes the singularity of semantics, that is, semantics is only one-dimensional. This is not the case, however, and the semantics should be multidimensional.
For example, when we talk about "home", some people will think of synonyms "family", from "family" and will think of "relatives", these are similar words; In addition, from the "home", some people will think of "Earth", from "Earth" will Think of "Mars". In other words, "family", "Mars" can be regarded as a "home" of the two-level approximation, but "family" and "Mars" itself there is no obvious connection. In addition, semantically speaking, "university", "comfort" can also be regarded as the "home" level two approximation, obviously, if only through a unique number, it is difficult to put these words in the right place.
Word2vec: Hi-dimensional is here.
As you can see from the above discussion, many words are meant to diverge in all directions, rather than in a single direction, so the unique numbering is not particularly desirable. So, how many numbers. In other words, the word corresponds to a multidimensional vector. Yes, that's the very right idea.
Why multidimensional vectors are feasible. First of all, multidimensional vector solves the problem of multiple directional divergence of words, only two-dimensional vector can be 360 degrees omni-directional rotation, not to mention the higher dimensions (the actual application is generally hundreds of dimensions). Secondly, there is a more practical problem, is that the multidimensional vector allows us to use a small number of changes to characterize the word. How to say. We know that in terms of Chinese, the number of words is as much as hundreds of thousands of, if each word is given a unique number, then the number is changed from 1 to hundreds of thousands of, the range is so large that the stability of the model is difficult to guarantee. If it is a high dimensional vector, say 20 dimensions, then only 0 and 1 can be used to express (1 million) words. A small change can guarantee the stability of the model.
So much so that I haven't really talked about the idea. Now the idea is, the question is how to put these words into the correct high-dimensional vector. And the point is to do it without a language background. (In other words, if I want to deal with English language tasks, I do not need to learn English first, but only need to collect a lot of English articles, this is how convenient.) Here we cannot and need not do more of the principle of the deployment, but to introduce: and based on this idea, there is a Google Open source well-known tool--word2vec.
In simple terms, Word2vec is accomplishing what we want to do--using high dimensional vectors (word vectors, word embedding) to denote words, and to place similar words in a similar position, with a real vector (not limited to integers). We only need to have a lot of language corpus, we can use it to train the model, to obtain the word vector. The benefits of word vectors have been mentioned earlier, or they have been asked to solve the problems mentioned above. Another advantage is that word vectors can be easily clustered, with Euclidean distance or cosine similarity to find two words with similar meanings. This is tantamount to solving the problem of "one word, many words" (unfortunately, there seems to be no good idea to solve the problem of polysemy.) )
As for the mathematical principle of word2vec, readers can refer to this series of articles.
http://blog.csdn.net/itplus/article/details/37969519
and the implementation of Word2vec, Google provides the official C language source code, readers can compile themselves. Python's Gensim library also offers ready-made Word2vec as a sub library (in fact, this version looks more powerful than the official version). Expressing a sentence: a vector of sentences
The next issue to be addressed is: we have already divided the words, and have converted the words to a high dimensional vector, then the sentence corresponds to the word vector set, that is, the matrix, similar to image processing, the image of the digital after the corresponding pixel matrix; but the model input generally only accept one-dimensional characteristics, then how to do it. A relatively simple idea is to flatten the matrix, which is to make the word vector one after the other, to form a longer vector. This idea is OK, but this will make our input dimension up to thousands of or even tens of thousands of dimensions, in fact, is difficult to achieve. (If tens of thousands of-D is not a problem for today's computers, then the image for 1000x1000 is as high as 1 million dimensions.) )
In fact, for image processing, there is already a set of mature methods called convolution neural Network (CNNs), which is a kind of neural network, specifically used to deal with matrix input task, can encode the matrix form of input into the lower dimension of the one-dimensional vector, and retain most useful information. Convolution neural network that can also be moved directly to natural language processing, especially in the text of emotional classification, the effect is good, related articles have "Deep convolutional neural Networks for sentiment" analysis of the short Texts ". But the principle of the sentence is different from the image, directly to the image that a set of language, although slightly small, but always make people feel neither fish nor fowl. Therefore, this is not the mainstream approach in natural language processing.
In natural language processing, the commonly used methods are recurrent neural networks or cyclic neural networks (called Rnns). Their role is the same as convolution neural networks, which encode the input of the matrix form into the one-dimensional vector of the lower dimension, and retain most of the useful information. The difference with convolution neural networks is that convolution neural networks pay more attention to the global fuzzy perception (like we look at a picture, in fact, we do not see a pixel, but only the overall grasp of the picture content), and Rnns is to focus on the adjacent location of the reconstruction, this shows that for the language task, Rnns is more persuasive (language is always made up of adjacent words, adjacent words form phrases, neighboring phrases form sentences, etc.), so it is necessary to effectively integrate the information in the neighboring position, or to call it refactoring.
When it comes to the classification of models, it really means endless. Under this subset of Rnns, there are many variants, such as the ordinary Rnns, as well as GRU, lstm, and so on, the reader can refer to Keras's Official document: http://keras.io/models/, it is Python is a depth learning library, Provides a large number of in-depth learning models, and its official documentation is both a Help tutorial and a list of models-it basically implements the current popular depth learning model.
Build LSTM Model
It's time to do some real work after blowing so much water. Now we build a deep learning model for text affective classification based on lstm (Long-short Term Memory, long short-term memory artificial neural network), the structure of which is as follows:
The model structure is very simple, nothing complex, implementation is also very easy, with the Keras, it is for us to achieve a ready-made algorithm.
Now let's talk about two interesting steps.
The first step is to mark the collection of Corpus. Note that our model is supervised training (at least half supervised), so we need to collect some sentences that have been divided into categories, the number of which, of course, the more the better. And for the Chinese text emotional classification, this step is not easy, the Chinese information is often very scarce. The author of the model, a patchwork, through a variety of channels (there are online search downloads, there are in the data hall to pay for the purchase) collected more than 20,000 Chinese language tagging corpus (involving six areas) for training models. (Share at the end of the article)
The second step is the model threshold selection problem. In fact, the result of the training is a continuous real number in a [0, 1] interval, and the program defaults to a threshold of 0.5, which is to judge the result greater than 0.5 to be positive, and to judge the result of less than 0.5 as negative. This default value is not the best in many cases. As shown in the following figure, when we study the effect of different thresholds on true and true negative rates, we find that the curve curves are mutation in the interval (0.391, 0.394).
Although in absolute terms, only from 0.99 down to 0.97, the change is not small, but the rate of change is very large. Normally it is a smooth change, and mutation means there must be something unusual, and obviously the reason for this anomaly is very difficult to find. In other words, there is an unstable region, and the predictions in this area are in fact unreliable, so, to be on the safe side, we throw out the interval. Only if the result is greater than 0.394, we think is positive, less than 0.391, we think is negative, is 0.391 to 0.394, we are to be determined. The experiment shows that this method can improve the application accuracy of the model.
Say a little summary
The article is very long, a rough introduction of depth learning in the text of emotional classification of ideas and practical application, a lot of things are generalities. Instead of writing a tutorial on deep learning, I just want to point out the key points, at least the ones I think are more critical. About the depth of study, there are a lot of good tutorials, it is best to read the English paper, the Chinese is a better blog Http://blog.csdn.net/itplus, the author is not in this respect shortcoming.
Here is my corpus and code. Readers may wonder why I share these "private collections". It's very simple, because I'm not in this line of work. Data mining is just a hobby for me, a hobby of math and python, so I don't have to worry about people being ahead of me.
Corpus Download: Sentiment.zip
Http://kexue.fm/usr/uploads/2015/08/646864264.zip
Collected Comment data: sum.zip
Http://kexue.fm/usr/uploads/2015/09/829078856.zip
Set up lstm to do text emotion classification code:
Import pandas as PD #导入Pandas
Import NumPy as NP #导入Numpy
Import Jieba #导入结巴分词
From keras.preprocessing import sequence
From keras.optimizers import SGD, Rmsprop, Adagrad
From keras.utils import np_utils
From keras.models import sequential
From Keras.layers.core import dense, dropout, activation
From keras.layers.embeddings Import embedding
From keras.layers.recurrent import lstm, GRU
From __future__ import Absolute_import #导入3. x's characteristic function
From __future__ import print_function
Neg=pd.read_excel (' Neg.xls ', Header=none,index=none)
Pos=pd.read_excel (' Pos.xls ', Header=none,index=none) #读取训练语料完毕
pos[' Mark ']=1
neg[' Mark ']=0 #给训练语料贴上标签
Pn=pd.concat ([pos,neg],ignore_index=true) #合并语料
Neglen=len (NEG)
Poslen=len (POS) #计算语料数目
CW = Lambda X:list (jieba.cut (x)) #定义分词函数
pn[' words '] = pn[0].apply (CW)
Comment = Pd.read_excel (' Sum.xls ') #读入评论内容
#comment = pd.read_csv (' a.csv ', encoding= ' utf-8 ')
Comment = comment[comment[' ratecontent '].notnull ()] #仅读取非空评论
comment[' words '] = comment[' ratecontent '].apply (CW) #评论分词
D2v_train = Pd.concat ([pn[' words '], comment[' words ']], Ignore_index = True)
w = [] #将所有词语整合在一起
For I in D2v_train:
W.extend (i)
Dict = PD. Dataframe (PD. Series (W). Value_counts ()) #统计词的出现次数
Del W,d2v_train
dict[' ID ']=list (range (1,len (dict) +1))
Get_sent = Lambda x:list (dict[' id '][x])
pn[' sent '] = pn[' words '].apply (get_sent) #速度太慢
MaxLen = 50
Print ("Pad sequences (samples x time)")
pn[' sent '] = List (sequence.pad_sequences (pn[' sent '), Maxlen=maxlen))
x = Np.array (list (pn[' sent ')) [:: 2] #训练集
y = Np.array (list (pn[' Mark ']) [:: 2]
XT = Np.array (list (pn[' sent ')) [1::2] #测试集
YT = Np.array (List (pn[' Mark '])) [1::2]
Xa = np.array (list (pn[' sent ')) #全集
ya = Np.array (list (pn[' Mark '))
Print (' Build model ... ')
Model = sequential ()
Model.add (Embedding (Len (dict) +1, 256))
Model.add (LSTM (256, 128)) # Try using a GRU instead, for fun
Model.add (Dropout (0.5))
Model.add (Dense (128, 1))
Model.add (Activation (' sigmoid '))
Model.compile (loss= ' binary_crossentropy ', optimizer= ' Adam ', class_mode= "binary")
Model.fit (x, Y, batch_size=16, nb_epoch=10) #训练时间为若干个小时
Classes = model.predict_classes (XT)
ACC = np_utils.accuracy (classes, YT)
Print (' Test accuracy: ', ACC)