Wunda "Deep learning" fifth course (2) Natural language processing and word embedding

Source: Internet
Author: User

2.1 Lexical representation

(1) The use of the One-hot method to express the vocabulary has two major shortcomings, 10,000 words for example, each word needs to be represented by 10000 dimensions, and only one number is zero, the other dimensions are 1, resulting in a very redundant, large storage capacity; second, each word represents a vector multiplied by 0 (orthogonal), The inability to express the link between words, such as Oriange and Apple,queen and King, should be closely linked, but it is not possible to show this in the dictionary above.

(2) using the new feature notation, as shown, features such as gender, nobility, age, color, size, food and so on as a feature, then a word in line with a feature will take a larger value, completely nothing to take close to 0, added set up 300 features, Then each word can be represented by a 300-dimensional vector (which solves the fact that it is too redundant), and similar words (such as Apple and Oriange) may have very similar values (i.e., high similarity) in different features, thus making it possible to link words.

(3) in the above mentioned features, in practice is through the network to learn, specifically to indicate what characteristics are no one knows.

2.2 Using Word embedding

(1) How to use Word embedding to do migration learning steps:

1. First learn the word embedding from a large number of text machines (Word embedding is actually using some features to express a word, rather than through One-hot), of course, you can also directly use other people's trained words embedded.

2. Then, depending on the amount of data on your new task, decide if you want to adjust the word embedding, and if the amount of data is small, embed it directly using the previous word. More data can be fine-tuned.

(3) The word embedding and the human face coding is very similar, a little difference is: Word embedding, there is a fixed vocabulary, and in the face code may have a completely not seen face, and then need to encode him.

2.3 Word Embedding Features

(1) Ask the question: If man corresponds to woman, what should king do? The method is that man's vector minus the value of the woman vector should (about equal to) be equal to the king vector minus the vector of a word, as follows:

The similarity between the two sides of the equation is found after the item is moved:

(2) using the cosine similarity (which is actually the cosine of the angle), a value close to 1 indicates the more similar:

2.4 Embedding matrices

(1) If the vocabulary is 10000, each word is represented by 300 features, then the embedding matrix is a 300*10000 matrix, and the embedded matrix is multiplied by the vector of the one-hot representation of a word, which is represented by a 300-dimensional representation of the word, as shown in:

(2) The above multiplication because one-hot only a specific value is 1, so actually the essence of multiplication is to take out the word in the location of the matrix embedded in the column, so the actual is not so complex matrix multiplication, but in some way directly out of the column can be, such as the Keras has an embedded layer, Instead of a very slow and complex multiplication of matrices, we use this embedded layer to efficiently present the columns you need from the embedded matrix.

2.5 Learning Word Embedding

(1) The case is: what is the word that predicts "I want a glass of oriange___"? You can use the one-hot I and then input to an embedded layer, the embedded layer will do the operation of the embedded matrix and the one-hot vector to get the embedded vector of the word, and then input to the back of the network for output, so through training this entire network, you can also learn the embedded matrix.

(2) There is a fully connected layer, so the input dimension needs to be fixed, there is a context, such as the first 4 words with the target prediction position (when the embedded layer output a total of 4*300=1200 input into the next layer of the network), or the target position before and after each of the 4 words, It has been proved that it is very good to use only the previous word of the target position as the context in the training embedding matrix.

2.6word2vec

(1) using the Skip-gram model, the practice is to sample the context C first, and then the target Word will be sampled within the positive and negative 10 words of context C. In other words, in the sentence "I want a glass of orange juice to go alone with my cereal", the choice context C is assumed to be orange, and then the probability of a word appearing within the distance of Orange10, for example, by selecting the GLA SS, at which point the glass is the equivalent (output label) y, and Orange is x.

(2) Skip-gram can be understood as taking a word, and then the probability of what words appear around the word is relatively large, so as to construct a supervised learning, the purpose is to learn the embedding matrix.

(3) The network and cost functions are as follows (where Y,y caps are represented by One-hot, and y is a word that the context orange randomly takes within a certain range of words):

(4) Using the above algorithm has a problem is that the computational amount is very large, because according to the loss function, the sum operation is very slow because the vocabulary is very large may reach millions. The solution has the classification (hierarchical) of the Softmax classifier and the negative sampling (negative sampling).

(5) The idea of grading is (take 10,000 words for example), the first classifier tells you the target word in the first 5000 or the last 5000, and then the second classifier tells you whether it is the first 2500 or the last 2500, so that the computational complexity is the vocabulary to take the logarithm, rather than linear. In addition, the commonly-used words of the tree are in relatively shallow places, where the words are uncommon in deeper places, as shown.

2.7 Negative sampling

(1) Negative sampling can turn the above Sotfmax into a two classification problem, which can greatly reduce the computational amount.

(2) The practice of negative sampling is to select a context word, and then within the context of the scope of the word (such as 10 words) to pick a target word, then the output of the network is 1, and then randomly select K words in the dictionary as a negative sample (even if the word selected in the context of the scope of the word also does not matter) As shown in the following:

(3) A sample pair corresponds to a two classifier (that is, the output has only one neuron (essentially the one that softmax10000 dimension corresponds to the target word), and the calculation is very simple), compared to the previous sotfmax (output has 10,000 neurons) as follows:

(4) How to select the sample according to the following way, where F (WI) is observed in the corpus of the word frequency of an English word.

2.8GloVe Word vector

(1) Xij indicates the number of occurrences of the word I in the word J context, so I and J here are the same as the function of target and context, so it can be considered that xij equals Xji.

(2) The objective function is to minimize the following formula (the contract Xij is 0 o'clock, F (xij) is also 0, and the contract 0log0=0,f (XIJ) is a weight):

2.9 Emotional classification

(1) Case 1: Input x is a piece of text, and the output y is the corresponding emotion you want to predict, for example, a rating of a restaurant, score 1-5.

(2) Direct download using other people with very big data training in the embedded matrix can be good.

(3) method one uses the following network, each word one-hot after the embedding matrix, forming a word embedding, and then the characteristics of all the words are averaged, and then the average is fed into the Softmax to get the final score.

(4) The problem with the previous method is that it does not take into account the positional relationship of the word, all in the sentence "Completely lacking in good taste, good service, and good ambiance." Because of the good very much, so the final forecast to get a higher score, but the actual sentence is a bad comment. So the second method is to use the cyclic neural network RNN, which will be a many-to-one problem, so that you can get a very good emotional analysis of the system.

2.10 Word Embedding in addition to bias

(1) This is a biased system when a system father corresponds to a doctor,mother corresponding to nurse.

(2) Taking the elimination of gender bias as an example, the main steps to eliminate deviations are described. The first step is to take the other group (such as He,she) to do ehe-eshe,egirl-eboy and then sum the results to the average, and get the following axes:

(2) The second step is a medium and a step, meaning to move the word that should be unbiased to the axis, as shown in:

(3) The third step is to balance the step, that is, to move a pair of words to the axisymmetric (such as grandmother and grandfather is not about the axis symmetry, so grandmother and the babysister distance after the step closer), as shown in

(5) One point is that, for gender, there are very few words with a clear gender of one by one, and a two classifier is used to determine whether a word has a definite gender, and then all other words can be dealt with in the above steps with the explicit gender of these words to solve the problem of prejudice. The same is true of other prejudices.

Wunda "Deep learning" fifth course (2) Natural language processing and word embedding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.