Main content: On the basis of Google Word2vec, consider the method of vectorization of the article (document), and draw lessons from the Chinese restaurant process in the stochastic process.
Chinese Restaurant process: basically the process is that there are an unlimited number of tables in the restaurant, and each table can sit infinitely many people. When the first customer comes in, he opens a table and sits down, and when the n+1 customer comes over, he has the probability of n/(n+1) sitting on the table where n people already occupy the front, and 1/(n+1) The probability of opening a new table to sit alone. Refer to the wiki for details (http://en.wikipedia.org/wiki/Chinese_restaurant_process)
Author main ideas: 1. Google Word2vec can express the word as a vector, how to express an article composed of words? One of the simplest ways to do this is to add the values of each dimension of the vector of all the words in the article to form a new vector. (There is also the method of averaging, or seeking a weighted average). But I'm afraid this will lose the semantic information of the article. 2. The author's idea is that the words in the article cluster, select a representative of a word clustering results, to express the article (the main semantic content), and the cluster results in the word vector, and finally get the representative of the article is a vector. 3. There are many clustering methods, such as Kmean, hierarchical clustering, and so on. However, these methods depend on the "geometric distance" between the word vectors, such as: Euclidean distance. Google Word2vec results, the calculation of cosine effect is better, in the geometric distance measurement, the effect has yet to be verified. So the above method is risky. 4. The author finally uses Chinese restaurant process to do the word clustering--the original stochastic process is deformed according to demand.
The author's practice: Step1: When you meet the first word of the article, create a cluster with this word, and remember that it corresponds to the true vector (it should actually be remembered that all the words in the cluster are in each dimension and-- If you think of this cluster as a doc, that is the way the doc is expressed) Step2: When the first n+1 word comes up, assume that the N-word cluster before this time becomes a C category, Calculates the cosine distance of the current word and the C-word category, as described in Step1: the vector of each word class is the sum of the vectors of all the words in that category, and remembers the most similar category c_i, and the corresponding similarity value sim_i. At this point, there are two different approaches: step3-1: Create a new word cluster with the probability of 1/(n+1), or merge the n+1 Word into category c_i and update the representation vector for the category CI. This approach makes it easier to form more categories, and there are not too many words in each category. In the end, the authors did not adopt this approach, but instead used the more complex approach below. Step3-2: Judging whether sim_i > 1/(n+1), if so, then the n+1 words are merged into category C_i, and the expression vectors of category ci are updated; otherwise, a new word is created with the probability of 1/(n+1) cluster, the probability of n/(n+1) will be n+ 1 words are still merged into the category C_i. This approach, the author reviews, will produce fewer categories, and the words in each category are more, which is the method that the author eventually adopted.
How to choose cluster as the representative of DOC cluster? In the article, the author is seen by the naked eye-Khan! However, the author also provides some ideas for automatic selection: 1. The greater the IDF of words in Idf:cluster, the greater the Association of words and articles in that category. 2. POS: ... 3. Use the same method to generate a representation vector for each class in Doc, and a representation vector for the title, to calculate the association of cluster and title, the larger the association, the more it can represent the article
Attached: "From Word2vec to doc2vec:an approach driven by Chinese restaurant process" article Link: http://eng.kifi.com/ from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process/
If reproduced please specify the source: http://blog.csdn.net/xceman1997/article/details/46277129
Finish.
"Doc2vec" study notes: from Word2vec to doc2vec:an approach driven by Chinese restaurant process