Word Vector __ Depth Learning

Source: Internet
Author: User

This tutorial comes from an introductory guide to the Deep learning framework paddlepaddle. I did not modify the previous theoretical Knowledge section, is to add their own application in the following examples to facilitate understanding. Word vector

This tutorial source code directory in Book/word2vec, first use please refer to Paddlepaddle installation tutorial, for more information please refer to this tutorial's video class. Background Information

In this chapter we introduce the vector representation of words, also known as word embedding. Word vector is a common operation in natural language processing, and is a common basic technology behind Internet services such as search engine, advertising system and recommendation system.

In these Internet services, we often compare the correlations between two words or two paragraphs of text. In order to make such comparisons, we often first have to express the word as a computer's appropriate way to deal with. The most natural way is probably the vector space model.
In this way, each word is represented as a real vector (one-hot vector), the length of which is the dictionary size, each dimension corresponds to each word in a dictionary, except that the value on the corresponding dimension is 1, and the other elements are 0.

One-hot vector, though natural, is of limited use. For example, in the Internet advertising system, if the user input query is "Mother's Day", and one of the advertising keyword is "carnation." Although it is common sense that we know that these two words are linked-mother's Day should usually give a bouquet of carnations to mothers, but these two words correspond to the distance metric between one-hot vectors, whether Euclidean distance or cosine similarity (cosine similarity), Because of its orthogonal vector, it is considered that these two words have no relevance. The fundamental reason for the conclusion that contradicts us is that the amount of information in each word itself is too small. So, just given two words, is not enough to let us accurately determine whether they are relevant. To accurately compute the dependencies, we need more information--the knowledge that is summed up in a lot of data through machine learning methods.

In the field of machine learning, all kinds of "knowledge" are represented by various models, and word vector model (Word embedding models) is one of them. The word vector model maps a one-hot vector to a lower real vector (embedding vector), such as embedding (Mother's Day) =[0.3,4.2,−1.5,...],embedding (Carnation) =[ 0.2,5.6,−2.3,...] Embedding (mother's Day) = [0.3, 4.2, -1.5, ...], embedding (carnation) = [0.2, 5.6,-2.3, ...]. In this mapping to the real vector representation, it is hoped that two semantic (or usage) words corresponding to the word vector "more like", such as "Mother's Day" and "carnation" the corresponding word vector of the cosine similarity is no longer zero.

Word vector models can be probabilistic models, symbiotic matrices (co-occurrence matrix) models or neural network models. Before using neural network to find word vectors, the traditional method is to count the co-occurrence matrix X x of a word. x x is a | V|x| v| | v| \times | v| The size of the matrix, Xij X_{ij}, represents the number of words in the Glossary V (vocabulary) that appear at the same time as the first word and J in all Corpora, | v| | V| is the size of the glossary. For x x to do matrix decomposition (such as singular value decomposition, Singular value decomposition [5]), the resulting U U is regarded as the word vector of all words:

X=USVT X = usv^t

But there are many problems with this traditional approach:

1 because many words do not appear, resulting in extremely sparse matrix, so the need for the extra processing of the word frequency to achieve good matrix decomposition effect;

2 The matrix is very large, the dimension is too high (usually reach the order of magnitude of 106∗106 10^6*10^6);

3 need to manually remove the stop word (such as although, a,... Otherwise, these frequently occurring words also affect the effect of matrix decomposition.

The model based on neural network does not need to compute and store a large table in the whole corpus, but to get the word vector by learning semantic information, so it can solve the above problems well. In this chapter, we will show the details of training word vectors based on neural networks, and how to train a word vector model with paddlepaddle. Effect Show

In this chapter, when the word vector is trained, we can use the Data visualization algorithm (T-SNE[4) to draw a projection of the word feature on two dimensions (as shown in the following figure). It can be seen from the graph that semantically related words (such as a, the, these; Big, huge) in the projection on the distance is very close, semantically unrelated words (such as say, business; Decision, Japan) is far away from the projection.


Figure 1. Two-dimensional projection of word vectors

On the other hand, we know that the cosine of two vectors is within the range of [−1,1] [ -1,1]: two identical vector cosine values are 1, two mutually perpendicular vectors have a cosine value of 0, two directions the exact opposite of the vector cosine is-1, which is proportional to the correlation and cosine size. So we can also calculate the cosine similarity of the two-word vectors:

similarity:0.899180685161 please
input two words:big huge please

input two Words:from company
similarity:- 0.0997506977351

The above results can be obtained by running calculate_dis.py, loading the words in the dictionary and corresponding training feature results, and we will describe the usage in detail in the application model. Model Overview

Here we introduce three models of training word vectors: The N-gram model, the Cbow model, and the Skip-gram model, whose central idea is to get the probability of a word appearing through the context. For the N-gram model, we will first introduce the concept of language model, and in the subsequent training model, we use Paddlepaddle to implement it. The last two models, the most famous neuron word vector model in recent years, were developed by Tomas Mikolov at Google [3], although they were very light and simple, but the training worked well. Language Model

Before introducing the word vector model, we first introduce a concept: language model.
The language model is modeled on the joint probability function p (W1,..., WT) p (w_1, ..., w_t) of the statement, in which the WI w_i represents the first word in the sentence. The goal of the language model is to give the model a ballpark rate of meaningful sentences and to give small probabilities to meaningless sentences.
Such models can be applied in many fields, such as machine translation, speech recognition, information retrieval, POS tagging, handwriting recognition, and so on, and they all hope to get a continuous sequence of probabilities. In the case of information retrieval, when you search for "How Long's a football BAME" (BAME is a medical term), search engines will prompt you if you want to search "how Long's a football game" because the language model calculates "how Lo Ng is a football BAME "the probability is very low, but with BAME approximate, may cause the error the word, the game can make this sentence generation most probability."

Target probability P for a language model (<

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.