reference:http://licstar.net/archives/328 (
A comparative study of word vectorsOrigin: One-hot representation, PCA sequence: Why is NLP more difficult in pattern recognition?
Licstar's article begins by saying that language (words, sentences, chapters, etc.) belongs to the abstract entity of high-level cognition produced in the process of human cognition, while the voice and image belong to the lower primitive input signal .
speech, image data expression does not require special coding, but also has a natural order and relevance, the approximate number will be considered to be approximate characteristics. But the language is in trouble.
For example, the popular one-hot representation is a kind of not very good coding method, the data produced is much worse than the image, voice signal expression.
You can also compare the following: statistical data. Why is the data mining model simple? Because the statistic data is the artificial build out, the characteristic dimension is extremely low, is the human brain this big kill device refined the ultra-condensed characteristic.
So data mining does not require deep learning Ah, feature extraction ah what, and can not do so. Big Data, you run a more than 10-layer neural network to try?
Problem: Word order is not divided
In NLP, the expression of a sentence is simple. For example, CV loves NLP, as long as we create a thesaurus of all the words.
Then CV loves NLP can be expressed as a binary code [0,1,0,0,0,1,0,0,1], that is, the occurrence of the word is 1, does not appear as 0.
This is the famous One-hot representation feature notation that can be used to accomplish many of the tasks in NLP, but is it so satisfying?
So the question came, NLP loves CV and CV loves NLP is not a lump? This is one of the deadly problems: word order
Problem: The dimension is too high
Usually the size of a thesaurus is 10^5 if it continues to be encoded in binary. Then the dimension of a sentence is 10^5.
You know, alexnet a picture of the dimension is 256*256=65536, you have to take the GPU for a long time, 10^5 basically finished.
In fact, most of the 10^5 is a waste of dimensions, and the really useful features are hidden in so many dimensions.
This shows that the characteristic dimension of one-hot representation expression is too high, it needs dimensionality reduction. However, this is not the worst of the pit dad's flaws.
Bengio in a neural probabilistic language model in 2003, the dimension is too high to cause each study to force the majority of parameters to be changed.
Thus the butterfly effect, originally very good parameters, may be because of a small spread error, the change of the mess.
In fact, the traditional MLP network is making this mistake, and the 1D fully connected neuron controls too many parameters and is not conducive to learning sparse features.
CNN Network, 2D fully connected neurons control the local sensing field, is conducive to the dissociation of sparse features.
Question: The relevance of words
The famous N-gram model (n-ary model) is popular in the "Beauty of Mathematics" written by Dr. Wu.
In a sentence, the probability of a word's t appearing is related to its first n words. $P (t|t-1,t-2,.... t-n) $
Of course, the mathematical beauty does not mention the word vector, the early N-gram model is used to solve the reliability of a sentence.
That is: the probability of each word to multiply, who is the probability of large, which sentence is credible. $\max\prod _{t=1}^{t}p (t|t-1,t-2...t-n) $
In order to calculate $p (t|t-1,t-2,.... t-n) $, the simplest is to calculate the joint probability based on the word frequency statistic.
The trouble is that the low-frequency word probability is too small, even 0, resulting in the model is not smooth.
Therefore, the first proposed Katz Backoff method for artificial correction of the smooth model, placed in today, slightly clumsy.
Because we have a strong statistical model based adaptive perception Neural network.
Research: Word Frequency statistical model • "Dimensionality reduction" (reference from Stanford cs224d Deep Learning for NLP course)
• I like deep learning.
ILIKENLP.
ienjoyflying.
Let's say that there are 3 sentences that become our corpus, and we have noticed the problem of the correlation between words.
Using a new encoding to form a discrete statistical matrix, the number of context-related words = 1.
Statistics |
I
|
Like |
Enjoy |
Deep |
Learning |
Nlp |
Flying |
. |
I |
0 |
2 |
1 |
0 |
0 |
0 |
0 |
0 |
Like |
2 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
Enjoy |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
Deep |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
Learning |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
Nlp |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
Flying |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
. |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
After SVD singular value decomposition of this matrix, three matrices (n*r), S Array (r*r), VT Array (R*M) are obtained.
The traditional PCA uses the eigenvalue decomposition to reduce the dimension, which is more troublesome.
In fact, singular value decomposition SVD can also. To use the singular value to reduce the dimension, you can take the U-array, n is the number of data, R is a new dimension of dimensionality reduction.
The code in Python is as follows:
After you draw the first two dimensions of the U array, it's probably like this.
It can be seen from 9 to 2, some words that are more or less semantically close together.
This shows that the characteristics of the word vectors can be controlled in the lower dimensions.
Research: Word vector model • "Neural Network"
The first proposed using neural network to do NLP is Chinese cattle Xu Wei (formerly Facebook, now Baidu IDL Research Institute), put forward the method of NN training 2-gram.
The model of the formal training N-gram is presented by Bengio in the 2001&2003 year, which is the previous a neural probabilistic languagemodel.
Its structure is simple MLP network +softmax regression, a bit today DL flavor. (The output layer of the early MLP is not softmax).
Bengio the trained word vector called distributed represention, against one-hot representation.
In the input layer, each word is defined as a dimension $| A low-dimensional continuity vector fixed by m|$ (300 or more).
In a sentence, when running to the $i$ Word, the first n vectors are joined together to form a $| n|*| The input vector of the m|$.
In the hidden layer, the input is mapped to a high-level space and is activated by the sigmoid function.
In the output layer, is a size of $| v|$ output layer, V is the entire thesaurus size (usually 10^5).
Objective function: $arg \max\limits_{vec\&w}\prod _{t=1}^{t}p (t|t-1,t-2...t-n) $, t the number of words in a sentence.
That is, based on the first N words, predict the current word, so that the softmax probability of predicting the current word is the largest branch, $P (t|t-1,t-2...t-n) =\frac{e^{w_{t}x+b_{t}}}{\sum_{i=1}^{v}e^{w_{i}x+b_{i}} }$
It is necessary to train the word vector parameters of input layer and the w&b,softmax of hidden layer w&b.
Unlike the traditional NN, the input layer to the output layer linear straight edge (direct-connected) is more than the vector of the multi-training word.
The reason is the common problems of BP algorithm: Gradient vanish problem, the error passes through the hidden layer to the input layer, the gradient has been lost too much, affecting the training speed, so the introduction of straight edge, accelerated training.
Bengio, in his current paper, described "but there is no egg (it would not add anything useful)", but this direct edge is the birth of Word2vec.
The neural network method trained $p (T|T-1,T-2...T-N) $ self-smoothing, fully conforms to the Hinton proposed adaptive Perception (adaptive perception) principle.
Viewpoint: Why do neural networks predict the next word to train a word vector?
In fact, the next word prediction is accurate and the word vector training relationship is not big. Licstar's article introduces the Senna model of Collobert&weston.
Collobert&weston is a young scholar in NLP and neural computing, and Jason Weston is also invited to open a publicity class at cs224d to talk about his mermory Networks.
In the Senna model, the objective function is no longer predictive of the next word, but instead is replaced by a single output neuron, and the positive and negative samples are graded for the next word for regression analysis.
Obviously, the last neural network will be trained to play high scores for reasonable sentences and unreasonable sentences to score low. As a result, they still train excellent word vectors. But it's just a companion item.
So what exactly is the word vector about, and the answer is $context$, the context. As long as the training model has context, the network will automatically run in the direction of the training sentence. Such as:
• I like CV
ILIKENLP
In training I like NLP, the error correction of NLP is approximate to the error correction of the CV due to the contextual relationship.
In this way, the two grammatically similar words, CV and NLP, are gathered together without supervision.
Richard Socher in his deep learning for NLP Leture4: the training of word vectors is similar to pre-training in depth learning,
The word vector itself can be regarded as a PCA, this PCA can also self-study, self-learning PCA is not rbm&autoencoder it? Can refer to this popular science.
Why it can be seen as a pre-training, rather than the reason for training in the actual classification & regression model, he cites the following example:
• Hypothesis: Train A two classification model, Corpus: Movie Review, Task: Analyze the emotion of the commentary
• Situation: TV, telly in negative class of training set, negative class in Test set appears television (estimate reviewer is want to scold: This movie how rubbish is like a TV show)
On the left is the word vector unsupervised pre-training, and then supervise the fine-tuning test results, the right image is directly train.
Although television did not participate in the classification training, but because of its pre-trained word vector and telly, TV relatively close, so easy to be divided.
This is why the word vector method belongs to the deep learning camp.
Research: Linear Learning Word2vec model for word vectors only
Richard Socher, in his deep learning for NLP Leture4, mentions that another reason to train word vectors separately is the thesaurus $| V|$ is too big.
Not suitable for calculation in NLP tasks. In fact, word vector training is a pre-training part, the most special place is that its input is can be trained.
The bare-wire neural network model (Logistic&softmax regression) has long been abandoned for the general fixed-input pattern recognition problem.
The reason is that, in addition to some statistical data, there is very little data that is linearly correlated, and the implicit layer (or support vector) must be added to the ability to process nonlinear data.
But the input of the word vector is variable, that is, I use the linear model, there must be errors, if the input along the error is also given to Yiguoduan.
Then the input data will be forced to modify the linear correlation, this is Bengio did not think of the year, because at first, we all think that the non-linear model training parameters good.
Tomas Mikolov, a young scholar from Google, found this and removed the hidden layers from the Bengio model, resulting in a large number of linearly correlated vectors.
So there's the magic here: $Vec (King)-vec (man) +vec (Woman) \approx Vec (Queen) $
What happens if I input a linear word vector into a super-nonlinear neural network? Actually, this is a very good thing.
Because even the neural network, the basic calculation is only the input multiplied by the W plus + minus, there is no need to put the input card into how complex non-linear wonderful.
Linearity is good, the non-linear part should be given to the neural network to do, thus taking into account the speed and precision.
Such a process is probably: word vector linear pre-traning+ = Neural network nonlinear pre-traning= = Neural network fine-tuning
Deep learning officially launched the battle to NLP fire!
Word vector (wordvector)