Word2vec is Google's Open Source Toolkit, published in 2013, that can be used to vector word. Principle is as follows
A detailed explanation of the mathematical principles in Word2vec (i) Catalogue and preface
In simple terms:
In order to achieve the article or a passage of emotional analysis, there are several ways:
1. Simple divided into positive feelings and negative feelings, such as good on + 1, bad 1
2. Using bags of words, the term is considered independent, with the disadvantage of not considering contextual links
3. Using Word2vec, consider context
This method can compress the data scale while capturing the contextual information. Word2vec is actually two different methods:continuous Bag of Words (Cbow) and Skip-gram. The goal of Cbow is to predict the probability of the current word based on context. Skip-gram just the opposite: the probability of predicting the context based on the current word. Both of these methods use artificial neural networks as their classification algorithms. At first, each word is a random N-dimensional vector. After training, the algorithm obtains the optimal vector of each word using Cbow or Skip-gram method.
Reference
Source Documents < http://www.open-open.com/lib/view/open1444351655682.html >
Among them are the emotional analysis of emoji: The 40,000 tweets are divided into two types of optimism and pessimism, Word2vec converted into 300-D vectors and 8/2-point logistic regression training.
So the general step of using Word2vec is to have a large number of text, such as Baidu Encyclopedia, Wikipedia encyclopedia, the text on the news, composing TXT document;
The second step is to use Word segmentation tool to text segmentation;
The third step, the result of Word segmentation with Word2vec do training, unsupervised training of the word vector.
Therefore, the larger the text volume, the more authoritative the word vector will be more reasonable, can be explained.
Example:
1. Use Word tool ANSJ and WORD2VEC training news data
http://www.ppvke.com/Blog/archives/44422
Take shortcuts by using Wikipedia's text in Chinese:
a good training of Chinese word vectors http://www.cnblogs.com/Darwin2000/p/5786984.html
Another:
http://download.csdn.net/download/eastmount/9434889