Text depth representation model Word2vec

Last Update:2015-04-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text depth representation Model Word2vec Introduction

Word2vec is an efficient tool for Google to characterize the word as a real-valued vector in 2013, using deep learning ideas that can be trained to simplify the processing of text content into vector operations in K-dimensional vector spaces. And the similarity degree on the vector space can be used to represent the semantic similarity of the text. Word2vec output word vectors can be used to do a lot of NLP-related work, such as clustering, finding synonyms, part-of-speech analysis and so on. If we change the idea of the word as a feature, then Word2vec can map the feature to the K-dimensional vector space, which can be used to find a deeper characteristic representation of the text data .

Word2vec uses a vector representation of the word distributed representation. Distributed representation was first presented by Hinton in 1986 [4]. The basic idea is that by training each word into a k-dimensional real vector (k is usually a super-parameter in the model), the semantic similarity between them is judged by the distance between the words (such as cosine similarity, Euclidean distance, etc.). It uses a three-layer neural network , Input layer-hidden layer-output layer. There is a core technology is based on the word frequency with Huffman encoding , so that all words similar to the content of the hidden layer activation of the text is basically consistent, the higher the frequency of the words, they activate the number of hidden layers less, so effectively reduce the computational complexity. One reason why Word2vec is so popular is its high efficiency, Mikolov in paper [2] that an optimized stand-alone version can train hundreds of billions of words a day.

The three-layer neural network itself is modeled on a language model , but it also obtains a representation of a word on a vector space , and this side effect is the true goal of Word2vec.

Compared with the classical process of latent semantic analysis (latent Semantic Index, LSI), potential Dirichlet allocation (latent Dirichlet Allocation,lda), Word2vec uses the context of the word and the semantic information is more abundant.

Sample Experiment

A WORD2VEC system is deployed on the server, you can try playing

cd /home/liwei/word2vec/trunk./demo-analogy.sh # Interesting properties of the word vectors (try apple red mango / Paris France Italy)./demo-phrases.sh # vector representation of larger pieces of text using the word2phrase tool./demo-phrase-accuracy.sh # measure quality of the word vectors./demo-classes.sh # Word clustering./distance GoogleNews-vectors-negative300.bin # Pre-trained word and phrase vectors./distance freebase-vectors-skipgram1000-en.bin # Pre-trained entity vectors with Freebase naming

See the official website for detailed usage

Model analysis

The traditional statistical language model is a probability distribution function representing the basic unit of language (usually sentence), which is the generative model of the language. The general language model can be expressed in the form of conditional probabilities for each word:

p(s)=p(w_1^T )=p(w_1,w_2,…,w_T )=∏_t p(w_t |context)

Word2vec uses the Log-bilinear language model __, one of which is the Cbow (continuous bag-of-words model), which is predicted by the context of the next word w_t formula:

p(w_t |context)=p(w_t |w_(t-k),w_(t-k+1),…,w_(t-1),w_(t+1),…,w_(t+k-1),w_(t+k))

Cbow can be calculated using the hierarchical Softmax algorithm , which combines Huffman encoding, each word w can be accessed from the root node root of the tree along the only path, and its path forms its code code. Assuming that N (W, J) is the J node on this path, and L (W) is the length of the path, J encodes from 1, i.e. N (w, 1) =root,n (W, L (w)) = W. For the J node, the label defined by the hierarchy Softmax is 1-code[j].

Taking an appropriate size window as context, the input layer reads the words in the window, and the vectors (K-Dimension, initial random) are added together to form a hidden layer of k nodes. The output layer is a huge two-fork tree, and the leaf node represents all the words in the corpus (the corpus contains a V-independent word, then the two-tree has | V| a leaf node). And this whole binary tree algorithm is the Huffman tree. Thus, for each word of the leaf node, there will be a globally unique encoding, such as "010011", may wish to remember the left subtree is 1, the right subtree is 0. Next, each node of the hidden layer will have a connecting edge to the inner node of the binary tree, so that for each inner node of the binary tree there will be K-linked edges, and each edge also has the right value.

For a word in a corpus w_t, corresponds to a leaf node of a two-fork tree, so it must have a binary code, such as "010011". During the training phase, when a given context is expected to predict the w_t of the following words, we begin the traversal from the root node of the two-fork tree, where the goal is to predict each bit of the word's binary number. That is, for a given context, our goal is to make the binary encoding probability of the predictive word the most. Figuratively speaking, we hope that in the root node, the word vector and the root node connected through the logistic calculation to get bit=1 probability as close to 0, in the second layer, hope its bit=1 probability as close to 1, so all the way down, we will calculate the probability of multiplying, that is to get the target word w_ T in the current network of probability P (w_t), then for the current sample of the residuals is 1-p (w_t), so you can use the gradient descent method to train the network to get all the parameter values. It is obvious that the final probability value is normalized according to the binary encoding of the target word.

Hierarchical Softmax uses Huffman coding to construct two-fork tree, in fact, with the help of classification problem, using a series of two classification approximate multi-classification idea. For example, we use all the words as output, so "oranges" and "cars" are all mixed together. Given the context of the w_t, let the model determine whether w_t is a noun, and then judge whether it is the food name, and then determine whether the fruit, and then judge is not "orange."

But in the course of training, the model gives these abstract intermediate nodes an appropriate vector, which represents all the sub-nodes that it corresponds to. The hierarchical Softmax method is not equivalent to the original problem because the true word is common to the vectors of these abstract nodes, but this approximation does not result in a significant performance loss at the same time, which makes the model's solution scale rise significantly.

Instead of using this binary tree, we directly compute the probability of each output directly from the hidden layer-the traditional Softmax, which requires | Each word in the v| is counted once, and the time complexity of the process is O (| v|) Of Using a two-fork tree (such as the Huffman Tree in Word2vec), its time complexity was reduced to O (log2 (| v|), speed has been greatly accelerated.

Refer to official information

Word2vec Homepage

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of Word representations in Vector Space. In Proceedings of Workshop at ICLR, 2013

Theoretical data

Deep Learning Combat Word2vec

Deep Learning in NLP (i) Word vectors and language models

Word2vec Fool Anatomy

Practical information

Use Chinese data to run Google Open source project Word2vec

Word breaker tool ANSJ (instance)

Investigation of Word2vec in event mining

Using the Word2vec training model, it is possible to have a good semantic representation of query without the need for a literal intersection between the query. such as: "Swat 15 seconds to shoot down 5 thugs" and "station incident" and "3.1 Kunming incident" has a strong semantic relevance, which is the traditional TF-IDF method is not achieved.

In medical projects, such as diagnostic reports and inspection reports, short texts are common, so word2vec may achieve a good semantic characterization effect.

If it can be combined with the corpus of the oral hospital, it will be meaningful to get the result of the similarity of words such as this, and even map the traditional TF/IDF feature representation to the new vector space.

Text depth representation model Word2vec

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Text depth representation model Word2vec

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support