1. Introduction to "sentence vectors"
Word2vec provides high-quality word vectors and performs well in some tasks.
For how word2vec works, refer to the following articles:
Https://arxiv.org/pdf/1310.4546.pdf
Https://arxiv.org/pdf/1301.3781.pdf
For how to use a third-party library gensim to train word2vec, refer to this blog:
Http://blog.csdn.net/john_xyz/article/details/54706807
Although word2vec provides high-quality word vectors, there is still no effective way to combine them into a high-quality document vector. How can a sentence, document, or paragraph be projected into a vector space with rich semantic expressions? In the past, people often used the following methods:
Bag of words
LDA
Average word Vectors
TFIDF-weighting word Vectors
Bag of words has the following Disadvantages: 1. The order of words is not taken into account; 2. the semantic information of words is ignored. Therefore, this method has a poor effect on short text. It has a general effect on long text and is usually used as a baseline in scientific research.
Average word vectors simply average all word vectors in a sentence. Is a simple and effective method, but the disadvantage is that the order of words is not taken into account.
TFIDF-weighting word vectors is a common method for calculating sentence embedding based on the weighted sum of all word vectors in a sentence based on the TFIDF weight, compared to simply finding the mean for all word vectors, considering the TFIDF weight, the more important words in a sentence occupy a larger proportion. But the disadvantage is that the order of words is not taken into account.
The LDA model calculates the topic distribution of a document or sentence. It is often used for text classification tasks. I will write an article later to introduce the essential differences between the LDA model and doc2vec.
---------------------
Johnson0722
Source: csdn
Original: 79208564
Copyright Disclaimer: This article is an original article by the blogger. For more information, see the blog post link!
How to use vectors to represent document DOC or sentence