The algorithm of deep learning Word2vec notes
Statement:
This article turns from a blog post in HTTP://WWW.TUICOOL.COM/ARTICLES/FMUYAMF, if there is a mistake to hope Haihan
Objective
When looking at the information of Word2vec, often will be called to see that several papers, and that several papers also did not systematically explain the specific principles and algorithms Word2vec, so swaiiow on a dare to tidy up a note, hope to help you understand the basic principles of Word2vec as soon as possible to avoid wasting time.
Of course, if you already know it, just take a look.
A The network structure and usage instruction of Cbow plus level
There are two types of Word2vec, each of which has two strategies and a total of 4. Here, first of all, the most common one. This kind of network structure is like.
The first layer, the topmost layer, can be called the input layer. The input is a word vector of several words (the meaning of a word vector is to express a word as a vector, which is described later). The middle layer can become the hidden layer, is the sum of several words of the input vector, note is the sum of the vector, and the result is a vector.
The third layer is the box inside of the binary tree, which can be called the output layer, the hidden layer of the node to the output layer of the binary tree of all non-leaf nodes linked, too many lines painted. This binary tree of the third layer is a Huffman tree, each non-leaf node is also a vector, but this vector does not represent a word, represents a category of words, each leaf node represents a word vector, in order to simply use one w to express, there is no subscript. It is also important to note that the input of a few word vectors is actually the same as a few leaf nodes in the Hoffmann tree, of course, the input of the words with their final output of the word is not necessarily the same word, and basically not the same word, but these words with the output of the word often has a semantic relationship.
It is also important to note that all the leaf nodes of this Huffman tree represent all the words in the corpus, and that each leaf node corresponds to a word and does not repeat itself.
The function of this network structure is to accomplish one thing-to judge whether a sentence is a natural language. How do you judge it? Using the probability, is to calculate the sentence of the "combination of a list of words" the probability of the multiplication (joint probability) is how much, if relatively low, then it can be considered not a natural language, if the probability is high, is a normal word. This is actually the goal of the language model. The "combination of a list of words" in front of it actually includes the probability that a word is combined with its context, and a common case is the multiplication of probabilities of each word in combination with all the words in front of it, as described later.
For the above network structure, after the completion of network training, if given a sentence s, this sentence by the word w1,w2,w3,..., wt composition, you can use the calculation of this sentence is the probability of natural language, the formula is the following formula
The context of the word is expressed in terms of the term, that is, the word in front of a number of words, the "several" (hereafter referred to as C) is generally random, that is, generally from 1 to 5 of a random number, the meaning of each representative is before and after the C word is the case, the occurrence of the probability of the word. For example: "Everyone likes to eat delicious apples" This sentence a total of 6 words, assuming that the word "eat" C randomly pumped to 2, then the word "eat" the context is "everyone", "like", "Delicious" and ", a total of four words, the order of four words can be chaotic, This is a feature of Word2vec.
Calculate the time to use the above network, the specific calculation method with examples, the hypothesis is to calculate the word "eat" in the "everyone", "like", "delicious" and "the" four words as the context of the conditional probability, and assume that the word "eat" in the Hoffman is the right side of the leaf node, Then from the root node to reach it has two non-leaf nodes, the root node corresponding to the word vector named A, the right child node of the root node corresponding to the word vector named B, in addition to assume that "everyone", "like", "delicious" and "the" four words of the word vector and C, then
Among them, is the sigmoid formula.
It is important to note that if the word "eat" is on the right side of the left child node of non-leaf node B (assuming e) the leaf node, that is, in the middle of the three leaves on the right side of the figure, there is
Each word of the above sentence is calculated and connected to get the joint probability, if the probability is greater than a certain threshold, it is considered normal, otherwise it is not natural language, to be ruled out.
The description of the neural network is uninteresting, because the protagonist is not this probability, the most important thing in this neural network is the output layer of the Huffman Tree leaf node of those vectors, those vectors are called word vectors, the word vector is another blog inside the introduction, is a good thing.
How to get these word vectors is more an important process, but also word2vec this whole algorithm is the most important thing, will be carefully introduced later.
Two Optimization target and solution problem 2.1 calculation from Hoffman to conditional probability
It has been mentioned that the goal of language model is to judge whether a sentence is normal, as to how to judge the need to calculate a lot of conditional probabilities, and then to multiply these conditional probabilities to get the joint probability. This brings up the problem-how to calculate, there are many ways, the following chapters will be introduced. Here the Word2vec method of calculating this conditional probability is the use of the energy function of the neural network, because in the energy model, the function of the energy function is to transform the state of the neural network into a probability representation, which is mentioned in another blog post RBM, specifically to see Hinton's paper to understand. One of the great benefits of the energy model is the ability to fit the distribution of all exponential families. Then, if these conditional probabilities are considered to conform to the distribution of an exponential family, they can be fitted with an energy model. In short, word2vec that the conditional probability can be expressed by an energy model.
Since it's an energy model, you need an energy function, and Word2vec defines a very simple energy function.
E (a,c) =-(A? C
Where a can be considered a word vector, C is the context of the word vector and (vector and), basically you can think of C as the context; the middle dot indicates the inner product of two vectors.
Then according to the energy model (this model assumes that the temperature has been 1, so the energy function does not have the denominator), it can represent the probability of the word a in the context of the vector c.
(2.1.2)
where V is the number of words in the corpus, this definition means that in context C appears, the middle of the word is a probability, in order to calculate this probability, must be the corpus of all the word energy is counted once, and then according to the energy of word a, that ratio is the probability of a. This method of calculating probabilities is unique within the energy model, which is defined in the paper "Hierarchical Probabilistic neural Network Language model", which is changed in a form.
This probability is not really good statistics, in order to calculate the probability of a word, we have to calculate the context of all the words of energy, and then also calculate the exponential value plus and.
At this time the role of scientists also reflect, if the corpus of all the words into two categories, called the G class and H class, each half, where the word a belongs to Class G, then the following formula can be set up
P (a│c) =p (a| G,C) P (g| C) (2.1.3)
The meaning of this formula is clear, the probability of word a in the context of the condition C, and the probability of the following is equal--in the context of the condition C, a G-type word, while in the context of C, and should appear when the word is a G-type word, the probability of the occurrence of word a.
There is a proof in the paper "Hierarchical Probabilistic neural Network Language Model", which shows the original situation.
P (y=y│x=x) =p (y=y| D=d (y), X) P (D=d (y) | X=X)
where d is a mapping function that maps the elements in Y to the elements within the category D of the word. And there's a proof.
The equation (2.1.3) illustrates a problem that calculates the probability that a word a will appear in context C, by dividing the words in the corpus into two clusters, and then saving the computation. Now let's show you how to save the calculations, assuming that the cluster centers of the g,h are also represented by G and H, then p in the formula (2.3) (g| C) can be calculated using the following formula
That is, you can do it without a relationship. That is the cluster center, as long as the use of a vector of f=h-g class word vector can be calculated p (g| C), so this step is very time-saving. And look at the other step.
Since the number of words within G is only V/2, that is, when calculating the denominator, it is only possible to calculate the energy of the V/2 word. This has saved half of the calculation, but the scientists are greedy, so we have to continue to save, how come? Divide the G-class words into two clusters gg,gh,a in GH, and then
P (a│g,c) =p (a| GH,G,C) P (gh| G,C)
also has
And
The GG-GH can also be expressed in a vector of class words,
P (a│c) =p (a| GH,G,C) P (gh| G,C) P (g| C
To continue assuming that there are only two words left in the GHG cluster, and then two clusters for GHGG and Ghgh, where the cluster GHGG has only one word a, then P (a│c) can be calculated with the following formula
P (a│c) =p (a│ghgg,ghg,gh,g,c) p (ghgg| GHG,GH,G,C) P (ghg| GH,G,C) P (gh| G,C) P (g| C
where P (a| GHGG,GHG,GH,G) is 1, because there is only one word, substituting the formula (2.2) can be obtained, then there is
P (a│c) =p (ghgg| GHG,GH,G,C) P (ghg| GH,G,C) P (gh| G,C) P (g| C
That is
Assuming fff=ghh-ghg,ff=gg-gh,f=h-g again, then P (a| c) As long as the energy function of the three words and context c is counted, it really saves a lot of computation than the original one.
For the above Hoffman it is assumed that g means right, h means left, then A is the second leaf node from the right, which is the middle of the three W on the right of the figure. So F,FF,FFF is the three non-leaf nodes on this leaf node path.
But a word will always go to the left, one to the right, that is, at the root node, and one will be P (g| C) so f=h-g, and one will be P (h| C) so f=g-h, if f is the only value in each node, it is possible to represent the non-leaf node directly using the word vector once. It's hard not to fall for scientists, so f has always been equal to h-g, so there has been
and have P (g| C) =1-p (h| C).
This allows each non-leaf node to be represented by a single word vector.
See here, always should understand why P (a| C) so forget it. Another case, the probability of the probability of the above calculation method is not the same reason?
Summed up, you can use the following formula to calculate the
where C represents the context of the word vector after the addition of the vector, qk to the leaf node from the root node to the path of those non-leaf nodes, DK is encoded, can be said to be classified, because in the Hoffmann tree each non-leaf node has only two children nodes, That can be thought of when WI is dk=0 on the leaf node of the left subtree of the node, otherwise dk=1. In this way each word can be expressed in a group of Huffman codes, and with the DK in the middle of the above, the entire P (w│context) can be computed using several non-leaf nodes and the Huffman code of the word W on the Hoffman tree.
It's important to understand that because you start to talk about how to train.
2.2 Objective function
Suppose the corpus is a sequence of sentences consisting of s sentences (not important in order), the entire corpus has a V word, and the likelihood function will be constructed as follows
(2.2.1)
Among them, T_j represents the number of words in the first J sentence, and the maximum likelihood is to be done for the whole corpus. The logarithmic likelihood will look like this.
(2.2.2)
If there is a 1/v, logarithmic likelihood also some people called cross-entropy, this specific also do not understand, do not introduce, do not 1/v words, is the normal maximum likelihood of appearance.
Interested students can be extended to the appearance of the document, here is not introduced.
But for Word2vec, the likelihood function above is changed to look like this
Where the CIJ represents the word vector that the context adds. Logarithmic likelihood is the following
Don't 1/v here.
This should look familiar, much like the logical regression--logistic regression model of the probabilistic output of the two classification. Yes, that's what Word2vec is thinking, the case of the Hoffman tree to the left, which is the case of dk=0, is considered a positive class, and right is considered a negative class (where the positive and negative classes represent only one of two categories). So whenever there is a context C and a word in the left subtree case, it is considered to have a positive class sample, or a negative class sample, each sample of the probability of belonging to the positive class can be calculated with the above parameters, that is, if it is right, it is used to calculate its probability. Note that each word can produce multiple samples, since the root node of the Huffman tree begins, each leaf node produces a sample, the label of the sample (that is, a positive or negative class sign) can be generated with Hoffmann code, said before, left Huffman code dk= 0, so it is natural to use 1-DK to represent each sample label.
Here, the Huffman code has become an important thing.
This is much better, the problem is clear, the above L (θ) is a logarithmic likelihood, and then the negative logarithm likelihood f=-l (θ) is the need to minimize the objective function.
2.3 Solution
The solution is to use SGD, the blog "Online learning algorithm Ftrl" said some of the conditions of the SGD algorithm. Specifically, each sample is iterated, but each sample affects only its associated parameters, and the parameters that are irrelevant to it are not affected. For the above, the negative logarithm of the first ij of the J sample is
The negative logarithmic likelihood of the first Kij of the J sample when encountering the first non-leaf node is
To calculate the gradient, note the parameters include and, where the gradient is used to calculate the time used. It is also important to note that the gradient of logσ (x) is 1-σ (x), and the gradient of log (1-σ (x)) is-σ (x),
And
The above FQ and FC are shorthand, and with a gradient you can iterate over each parameter.
Also, the word vectors for each word can be iterated.
Note that the second iteration of WI represents all of the input words, that is, if you enter 4 words, these four words will be iterated according to this way. The second iteration is really not good to understand, because the meaning of all non-leaf nodes on the gradient of the context is all added to the context of the word gradient, it seems that this is the BP neural network error reverse propagation.
Paper "Hierarchical probabilistic neural Network Language Model" and "three New graphical Models for statistical Language Modellin "G" seems to be such a kind of explanation, people are the context of a number of words connected to the end of a vector, the long vector has a gradient, or a large v*m matrix (M is the dimension of the word vector), the matrix each element has a gradient, these gradients naturally include the gradient of the input word.
If someone finds out the explanation for this procedure, please let us know.
2.4 Trick in the code
As above, C for the left and right to take how many words, the code C is a number from 0 to Window-1, is randomly generated for each word, the second window is the user input a variable, the default value is 5. The actual implementation of the code is to change a method to first generate a 0 to window-1 a number B, and then the training of the word (assuming that the word i) of the window is from the beginning of the first word i-window+b to the end of the i-window+b word. It is important to note that each word has a different C, which is randomly generated.
If someone has read the code, they will find that q_ (K_IJ) is represented in the code with a matrix syn1, and C_ (I_j) is represented in the code with NEU1. Each word vector inside the leaf node is represented by Syn0 in the code and is moved to read using the subscript.
The core code is as follows, where vocab[word].code[d] means D_ (K_IJ), the other is the iterative process, the code is very concise AH.
The 419 lines in the code are calculated cij,425-428 line is calculated F, that is σ (Q_ (k_ij)? C_ (I_j)), 432 lines is the cumulative CIJ error, and 434 is the update q_ (k_ij) ^ (n+1).
Note that each input word is updated, and the amplitude of the update is the result of error accumulation in 432 rows.
Three Network structure and usage instruction of Cbow plus sampling 3.1 network structure and usage instructions
The network structure is as follows
As in (ii), the middle of the hidden layer is the context of a vector of words added up, and then there is a matrix R, is used in the training process of the temporary matrix, the matrix is connected to the hidden layer and all the output node, but this matrix in the use of this network is not used to get, here is not clear of a place. Each output node represents a word vector.
Similarly (ii) in the example, the calculation of such a probability, here is a simple calculation method, is random from the corpus to extract C words, here assume c=3, smoked in the d,e,f these three words, and assume "eat" the word vector is a, then calculate "eat" the probability of the word with the following formula
Similarly, as in (ii), each word of that sentence is calculated and connected to get the joint probability, if the probability is greater than a certain threshold, it is considered normal, otherwise it is not natural language, to be ruled out.
Here is just an example of how this network is, and what really matters is always those word vectors.
Four The meaning and objective function of the 4.1 sampling method for the optimization target and solution problem of Cbow sampling
Why do we have to sample it? The purpose is the same as the Huffman Tree in (ii), in order to save the computational amount, which is calculated in Formula P (2.1.2) (a| C) probability, because this probability is really not good to calculate. The paper "distributed representations of Words and phrases and their compositionality" mentions a method called nCE to replace the one above hierarchical The Softmax method (that is, the method of using the Huffman tree), but since Word2vec only cares about how to learn high-quality word vectors, it uses a simple nce method called Neg, the essence of which is to use the following formula in the first J sentence of the IJ word wij.
where e below the subscript means that WK is in accordance with a certain distribution, where P_v (W) represents the distribution of word frequency.
The second item of this equation is to ask for k expectations, and each of these k expectations is not expected in the context of that word. Here a particularly big lazy, is that this expectation as long as the extraction of a sample can be calculated, of course, if said to traverse the entire corpus, in fact, can also be considered to have taken many experiments, because each time the word extraction, is based on the frequency distribution to sample, if the number of sampling enough, in the overall results , the expectation here is still close to the true value of the expectation, which can be understood by the Monte Carlo sampling when calculating gradients in the blog post RBM.
Here, from the code point of view, just a kind of estimate this expectation, all formulas are simplified into the following form
Using this formula instead of (2.2.2), you can get the target function of Cbow plus sampling (remove 1/v), the objective function is very similar to the logistic regression objective function, in which WIJ is a positive class sample, WK is a negative class sample.
To unify the representation, the positive class sample is set to a label of 1, the negative class sample is set to label 0, and the negative logarithm of each sample becomes the following way
Solution of 4.2CBOW Plus sampling method
Solution or with SGD, so for a word wij, the word itself is a positive class sample, at the same time on the word, but also randomly extracted k negative class samples, then each word in training has k+1 a sample, so to do k+1 times sgd.
Two gradients for each sample
And
Two gradients have such a similar form is very good, and then began to iterate, the code there is a strange place, is an open network structure of the v*m matrix, is used to hold the negative sample of each suction of the word vector, each line represents a word vector, exactly the word vector of all words. After each sampling of a word, take the iteration (that is, calculate the gradient), is the corresponding word vector in the matrix, and the matrix is also involved in the update, is the first gradient is actually updated in the matrix of the word vector. But in general it seems that the matrix is discarded after the iteration has been calculated, because the final updated word vector, each time only one, is the formula has been appearing the word wij, the final output, but also the input word vector (in the figure is the bottom of the output layer of those word vector), Of course, this matrix can be considered as the connection matrix of the hidden layer and the output layer.
where w represents each sample corresponding to the word vector, including wij and pumped word vectors, note that the update is the R connection matrix. The updated formula for Word vectors is as follows
Note that the value of each gradient is calculated by the corresponding line in the connection matrix R, so that it seems to avoid updating the output layer of the word vector every time, may be afraid of messing up the bar.
Trick in the 4.3CBOW sampling method code
Random numbers are generated by themselves, and the code itself writes a random number program.
Each word to perform negative sampling, if encountered the current word (that is, the Word label 1) early exit sampling, this step is completed ahead of time, this is a more puzzling trick.
The R matrix, which is said in the code with SYN1NEG, C_ (I_j) in the code with NEU1, the same, the leaf node inside each word vector in the code with SYN0, using subscript move to read.
442-446 is the sample code, is the author's own module, and then label this variable is the same as the label above, the F means Σ (w?). C_ (I_j)), Syn1neg saves the value of each row of the matrix R, neu1e or accumulates the error until the end of the sampling cycle, and then updates the word vectors of the input layer.
The update module for the input layer is the same as in (ii) above, all the input words are updated.
Five Optimization target and solution problem of Skip-gram plus level 5.1 network structure and usage instructions
Network structure such as
The WI is the corresponding word, the word WI and Huffman tree directly connected, there is no hidden layer. The method of use is still similar to the Cbow plus level.
When judging whether the phrase "people like to eat delicious apples" is a natural language, this is the same, such as the calculation of the word "eat", the same random c=2, to eat the word needs to calculate more, the probability of a total calculation is P (everyone | eat), p (like | eat), p (delicious | eat) and P (eat) A total of four, in the calculation p (Everyone | eat) This probability, to use the above figure in the two-fork tree, assuming that the word "everyone" in the Huffman root node of the right child's leftmost node, is the figure in the right number of the third leaf node. Assuming that the three non-leaf nodes from the root node to the path of the leaf node are a,b,c (from high to low), the word vector "eat" is set to D, then P (Everyone | eats) This probability can be calculated using the following formula probability
P (Everyone │ eat) = (1-σ (A? D)? σ (B?) D) σ (C? D
The same method of calculation p (like | eat), p (Eat) and P (eat), and then multiply the four probabilities, get the "eat" the word context probability, note that this is only a word probability.
The probability of all the words in an entire sentence is calculated and then connected, and the probability of this sentence being a natural language is obtained. This probability is considered normal if it is greater than a certain threshold, otherwise it is not considered natural language and should be ruled out.
Again, this is just an example of how this network is, and what really matters is always those word vectors.
5.2 Objective function
The
assumes that the corpus is a sequence of sentences consisting of s sentences (not important in order), the entire corpus has a V word, and the likelihood function will be constructed as follows
(5.2.1)
where T_j represents the number of words in the J sentence, W_ (u_ij+i_j) denotes one of the i_j words around the word w_ (C_IJ), noting that C_ij is different for each w_ (I_j). Great likelihood to be done to the whole corpus. The logarithm likelihood will look like the following
(5.2.2)
where v means that the whole corpus has no total number of words, and the whole objective function can also be called cross entropy, but this is not interesting here. , generally removed.
This involves calculating the probability of a word, such as the probability changes and the condition becomes the word to be examined in the input, and the word that calculates the conditional probability becomes the word in context. Of course, the calculation method has not changed, the previous introduction of the Huffman Tree calculation method Here is a touch of the same.
Where I means the word entered, that is, the word "eat" in the example of (5.1), then W represents the "Everyone" in the example, and QK represents a non-leaf node on the path of the leaf node where the word "Everyone" is located from the root node, DK is encoded, or can be said to be classified, when w in a node, such as Qk's left subtree on the leaf node dk=0, otherwise dk=1.
Use this to replace the in the likelihood function (5.2.1) above, and, of course, each variable is in the right place to get the total likelihood function.
to find the logarithm of this expression, get
Using this formula to replace (5.2.2) can get the total logarithmic likelihood function, that is, the objective function, the rest is how to solve.
It is important to note that several conditional probabilities are calculated when calculating the contextual probabilities of each word ("eat" in the example) (P (People | eat), p (like | eat), p (Eat) and P (eat), and each of these conditional probabilities also needs to walk several non-leaf nodes on the Huffman tree, Each walk to a non-leaf node, you need to calculate one. It can be seen that each non-leaf node, in the total logarithmic likelihood function, is similar to a sample of a logistic regression, in order to facilitate the description, it is used in the form of samples and labels to address these things.
As with the general logistic regression, each walk to a non-leaf node, if it is left, defines the label as 1, is a positive sample, otherwise the label is 0, is a negative sample. So label=1-dk, each non-leaf node produces a sample for the whole problem.
5.3 Solution
The solution selects SGD, and the contribution to the total objective function when processing each sample is
Calculate gradient
And
Update
5.4 Trick in the code
The update to the input word I is the whole Huffman tree after the entire error is calculated, this error is stored in the neu1e array.
Where 480-483 rows are calculated values, saved in F, Vocab[word].code[d] represents the value of DK, neu1e saves the cumulative error of all non-leaf nodes on the path from the root node to the leaf node.
This error is much simpler to propagate, and each time it is updated with one word, p (w| I) in the W.
Six Optimization target and solution problem of skip-gram-plus sampling
Let's just say it, and don't look at it clearly.
6.1 Network structure and usage instructions
The network structure is as follows
The use of the instructions is not said, the same is sampled.
In the case of (iv), there is a matrix R, which is a temporary matrix used in the training process, which joins the hidden layer with all output nodes, but this matrix is not used when using this network, and this is not a place to figure out. Each output node represents a word vector.
Similarly (v) in the example, calculate p (everyone │ eat) such a probability, here the calculation method is much simpler, is random from corpus inside extract C words, here assume c=3, smoked in d,e,f these three words, and assume "eat" the word vector is a, then calculate "Eat" The probability of the word is used in the following formula
Also, as in (v), each word of that sentence computes the probability of a few words with the context (P (everyone | eats), p (like | eat), p (Eat | eats) and P (eat), and the probability of the word "eat", and the probability of all words being calculated and then multiplied is the probability that the whole sentence is natural language. The remaining ibid.
6.2 Objective function and solution
The likelihood function is the same as (5.2.1), the logarithmic likelihood function is the same as (5.2.2), but the calculation is Logp (w| I) is not the same, the paper "distributed representations of Words and phrases and their compositionality" considered Logp (w| I) can use the following formula
Replace.
It is also possible to set the concept of positive and negative samples as in (iv). To unify the representation, the positive class sample is set to a label of 1, the negative class sample is set to label 0, and the negative logarithm of each sample becomes the following way
Gradient
And
Update
6.3 Code
Or each word is executed negative sampling, if encountered the current word (is the Word label 1) early exit sampling.
493-502 is the sample code, 505-508 is the calculation σ (w? I) is stored in F, Syn1neg is the value of each row in the matrix R. The neu1e still accumulates this error until a round of sampling is finished and then the word vectors of the input layer are updated.
Update the input layer or the same.
Seven Some summary
From the code, Word2vec's author Mikolov is a relatively real person, that method effect for a long time use which kind, also tangled very strict theory proof, code in the trick is also very practical, can refer to other places to use.
Thanks
A number of Google researcher's selfless public information.
Many bloggers of the blog material, including @peghoty, is deeplearning Learning group inside the Pigoti.
Reference documents
[1] http://techblog.youdao.com/?p=915 deep Learning actual combat Word2vec, NetEase Youdao PDF
[2] http://www.zhihu.com/question/21661274/answer/19331979 @ Yang Chao in the answer to the question "Word2vec some understanding"
[3] http://xiaoquanzi.net/?p=156 Hisen blog post
[4] Hierarchical probabilistic neural network language model. Frederic Morin and Yoshua Bengio.
[5] distributed representations of Words and phrases and their compositionality T. Mikolov, I. Sutskever, K. Chen, G. Corr ADO, and J. Dean.
[6] A Neural probabilistic language model Y. Bengio, R. Ducharme, P. Vincent.
[7] Linguistic regularities in continuous Space Word representations. Tomas Mikolov,wen-tau Yih,geoffrey Zweig
[8] Efficient estimation of Word representations in Vector Space. Tomas Mikolov,kai chen,greg Corrado,jeffrey Dean.
The algorithm of deep learning Word2vec notes