002-word vector, neural network model, Cbow, Huffman tree, negative sampling

Source: Internet
Author: User

Word vectors:

Whether it is a passage or an article, the word is the most basic constituent unit.

How to make computers use these words?

The point is how to convert a word into a vector

If in a two-dimensional space, had,has,have meaning is the same, so to be closer.

Need,help is very close to the same location.

To show the same, related.

Let's say the following example:

Which words are closer to the Frog frog? Synonyms

For two different languages, the language space is also very close after modeling,

So it can be said that the constructed word vectors are not related to the language category, but are modeled based on the semantic loop (context logic).

Neural Network model:

The input word vectors are connected to the end-to-end (projection layer), and the parameters are optimized in the transmission to neural networks.

The input vectors here also need to be optimized.

Training samples: Includes vectors of the first n-1 words, assuming that each word has a vector size m

Projection layer: (n-1) *m large vector

Output:

Indicates the context, the next word is exactly the probability of the first word in the dictionary

Normalization:

The goal is to ask what the word vector is for each word.

Advantages of neural Networks:

S1 = "I went to internet café today" 1000 times.
S2 = "I went to the Internet café today" 10 times


For N-gram models: P (S1) >> p (S2)
and the neural network model calculates the P (S1) ≈p (S2)

Neural network It seems that similar sentences and words are a thing

As long as there is one in the corpus, the probability of the other sentences will increase correspondingly.

Hierarchical Softmax:

Layered Softmax

Cbow: Gets the current word according to the context

Skip-gram: Gets the context based on the current word.

Cbow:

Cbow is the abbreviation of continuous bag-of-words model, which predicts the probability of the occurrence of the current word based on the contextual words.

If the context is present, the word w we want it to appear the probability that the bigger the better

We need to get to know a thing called Huffman Tree first.

Huffman Tree

Equal to the weight multiplied by the step length, the maximum weight of the first place, in the Word2vec, we can take the word frequency (probability) as the weight value.

This two classification can do Softmax stratification judgment, judging is not the following words to appear, and then put important in 1th place, 2nd bit ...

The construction flow of Huffman tree

Using Huffman Tree Coding:

a:111

c:110

B:10

d:0

In Huffman tree, how to decide the direction? (Decide about)

Using previous knowledge: Logistic regression

sigmoid function

Any numeric input to get the output of the 0~1, then you can classify the output to the left or to the right

Then say the cbow of the previous article

The input layer is the word vector of the context word, in the training Cbow model, the word vector is just a byproduct, and, to be exact, a parameter of the Cbow model. At the beginning of the training, the word vector is a random value, which is updated as the training progresses.
The projection layer sums it up, and the so-called summation is simply a vector addition.
The output layer outputs the most probable W. Because the vocabulary in the corpus is fixed | C|, so the above process can actually be regarded as a multi-classification problem. Given characteristics, from | Pick one of the c| categories.

If I finally need to get the word football, then the process is:

.

How to Solve:

Target function:

The bigger the better.

The maximum value is the problem of asking for a gradient rise.

Because it is linearly correlated with vectors and each word vector, the pair and vector updates can be applied to every word vector.

Skip-gram:

Also need to consider a problem, if the corpus is very large, that is, the use of Huffman tree, the common top of the line, then there are many unusual rows in the back, so that the computational complexity becomes very large.

One solution is called negative sampling (negative sampling):

We want to maximize the likelihood that the predictions will be right.

The meaning of multiplicative is that all the words can be predicted.

The desired value is the same, but is described by another method, the former is through the Huffman tree, now is the interval value.

Last updated word vector

002-word vector, neural network model, Cbow, Huffman tree, negative sampling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.