Word vectors:
Whether it is a passage or an article, the word is the most basic constituent unit.
How to make computers use these words?
The point is how to convert a word into a vector
If in a two-dimensional space, had,has,have meaning is the same, so to be closer.
Need,help is very close to the same location.
To show the same, related.
Let's say the following example:
Which words are closer to the Frog frog? Synonyms
For two different languages, the language space is also very close after modeling,
So it can be said that the constructed word vectors are not related to the language category, but are modeled based on the semantic loop (context logic).
Neural Network model:
The input word vectors are connected to the end-to-end (projection layer), and the parameters are optimized in the transmission to neural networks.
The input vectors here also need to be optimized.
Training samples: Includes vectors of the first n-1 words, assuming that each word has a vector size m
Projection layer: (n-1) *m large vector
Output:
Indicates the context, the next word is exactly the probability of the first word in the dictionary
Normalization:
The goal is to ask what the word vector is for each word.
Advantages of neural Networks:
S1 = "I went to internet café today" 1000 times.
S2 = "I went to the Internet café today" 10 times
For N-gram models: P (S1) >> p (S2)
and the neural network model calculates the P (S1) ≈p (S2)
Neural network It seems that similar sentences and words are a thing
As long as there is one in the corpus, the probability of the other sentences will increase correspondingly.
Hierarchical Softmax:
Layered Softmax
Cbow: Gets the current word according to the context
Skip-gram: Gets the context based on the current word.
Cbow:
Cbow is the abbreviation of continuous bag-of-words model, which predicts the probability of the occurrence of the current word based on the contextual words.
If the context is present, the word w we want it to appear the probability that the bigger the better
We need to get to know a thing called Huffman Tree first.
Huffman Tree
Equal to the weight multiplied by the step length, the maximum weight of the first place, in the Word2vec, we can take the word frequency (probability) as the weight value.
This two classification can do Softmax stratification judgment, judging is not the following words to appear, and then put important in 1th place, 2nd bit ...
The construction flow of Huffman tree
Using Huffman Tree Coding:
a:111
c:110
B:10
d:0
In Huffman tree, how to decide the direction? (Decide about)
Using previous knowledge: Logistic regression
sigmoid function
Any numeric input to get the output of the 0~1, then you can classify the output to the left or to the right
Then say the cbow of the previous article
The input layer is the word vector of the context word, in the training Cbow model, the word vector is just a byproduct, and, to be exact, a parameter of the Cbow model. At the beginning of the training, the word vector is a random value, which is updated as the training progresses.
The projection layer sums it up, and the so-called summation is simply a vector addition.
The output layer outputs the most probable W. Because the vocabulary in the corpus is fixed | C|, so the above process can actually be regarded as a multi-classification problem. Given characteristics, from | Pick one of the c| categories.
If I finally need to get the word football, then the process is:
.
How to Solve:
Target function:
The bigger the better.
The maximum value is the problem of asking for a gradient rise.
Because it is linearly correlated with vectors and each word vector, the pair and vector updates can be applied to every word vector.
Skip-gram:
Also need to consider a problem, if the corpus is very large, that is, the use of Huffman tree, the common top of the line, then there are many unusual rows in the back, so that the computational complexity becomes very large.
One solution is called negative sampling (negative sampling):
We want to maximize the likelihood that the predictions will be right.
The meaning of multiplicative is that all the words can be predicted.
The desired value is the same, but is described by another method, the former is through the Huffman tree, now is the interval value.
Last updated word vector
002-word vector, neural network model, Cbow, Huffman tree, negative sampling