"Doc2vec" study notes: from Word2vec to doc2vec:an approach driven by Chinese restaurant process

Last Update:2015-05-30 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Main content: On the basis of Google Word2vec, consider the method of vectorization of the article (document), and draw lessons from the Chinese restaurant process in the stochastic process.

Chinese Restaurant process: basically the process is that there are an unlimited number of tables in the restaurant, and each table can sit infinitely many people. When the first customer comes in, he opens a table and sits down, and when the n+1 customer comes over, he has the probability of n/(n+1) sitting on the table where n people already occupy the front, and 1/(n+1) The probability of opening a new table to sit alone. Refer to the wiki for details (http://en.wikipedia.org/wiki/Chinese_restaurant_process)
Author main ideas: 1. Google Word2vec can express the word as a vector, how to express an article composed of words? One of the simplest ways to do this is to add the values of each dimension of the vector of all the words in the article to form a new vector. (There is also the method of averaging, or seeking a weighted average). But I'm afraid this will lose the semantic information of the article. 2. The author's idea is that the words in the article cluster, select a representative of a word clustering results, to express the article (the main semantic content), and the cluster results in the word vector, and finally get the representative of the article is a vector. 3. There are many clustering methods, such as Kmean, hierarchical clustering, and so on. However, these methods depend on the "geometric distance" between the word vectors, such as: Euclidean distance. Google Word2vec results, the calculation of cosine effect is better, in the geometric distance measurement, the effect has yet to be verified. So the above method is risky. 4. The author finally uses Chinese restaurant process to do the word clustering--the original stochastic process is deformed according to demand.
The author's practice: Step1: When you meet the first word of the article, create a cluster with this word, and remember that it corresponds to the true vector (it should actually be remembered that all the words in the cluster are in each dimension and-- If you think of this cluster as a doc, that is the way the doc is expressed) Step2: When the first n+1 word comes up, assume that the N-word cluster before this time becomes a C category, Calculates the cosine distance of the current word and the C-word category, as described in Step1: the vector of each word class is the sum of the vectors of all the words in that category, and remembers the most similar category c_i, and the corresponding similarity value sim_i. At this point, there are two different approaches: step3-1: Create a new word cluster with the probability of 1/(n+1), or merge the n+1 Word into category c_i and update the representation vector for the category CI. This approach makes it easier to form more categories, and there are not too many words in each category. In the end, the authors did not adopt this approach, but instead used the more complex approach below. Step3-2: Judging whether sim_i > 1/(n+1), if so, then the n+1 words are merged into category C_i, and the expression vectors of category ci are updated; otherwise, a new word is created with the probability of 1/(n+1) cluster, the probability of n/(n+1) will be n+ 1 words are still merged into the category C_i. This approach, the author reviews, will produce fewer categories, and the words in each category are more, which is the method that the author eventually adopted.
How to choose cluster as the representative of DOC cluster? In the article, the author is seen by the naked eye-Khan! However, the author also provides some ideas for automatic selection: 1. The greater the IDF of words in Idf:cluster, the greater the Association of words and articles in that category. 2. POS: ... 3. Use the same method to generate a representation vector for each class in Doc, and a representation vector for the title, to calculate the association of cluster and title, the larger the association, the more it can represent the article

Attached: "From Word2vec to doc2vec:an approach driven by Chinese restaurant process" article Link: http://eng.kifi.com/ from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process/

If reproduced please specify the source: http://blog.csdn.net/xceman1997/article/details/46277129

Finish.

"Doc2vec" study notes: from Word2vec to doc2vec:an approach driven by Chinese restaurant process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More