What the word vector (distributed representation) works

Source: Internet
Author: User

Original: http://www.zhihu.com/question/21714667

4 answers Support objection, will not show your namePigotiLiu Xin, professor Mo to keep cats,Starling Niohuru and other people agree In order to deal with the algorithm of natural language in machine learning, it is usually necessary to first make the language mathematical, and the word vector is a way to make the words in the language mathematically.

One of the simplest word vectors is one-hot representation, which is to use a very long vector to represent a word, the length of the vector is the size of the dictionary, the vector component is only one 1, the other is all 0, and the position of the 1 is the position of the word in the dictionary. However, this word indicates that there are two disadvantages: (1) It is susceptible to dimensionality disaster, especially when it is used in some algorithms of deep learning, (2) It is not very good to describe the similarity between words (the term seems to be called "vocabulary gap").

Another is that you mentioned distributed representation this expression, which was first proposed by Hinton in 1986, can overcome the shortcomings of one-hot representation. The basic idea is:
By training each word in a language into a fixed-length short vector (of course, the "short" here is relative to the "long" of the one-hot representation), all these vectors together form a word vector space, and each vector is a point in that space, By introducing "distance" in this space, you can judge the (lexical, semantic) similarity between the words by the distance between them.

To better understand the above ideas, let's take a popular example: Assuming that there are N different points in the two-dimensional plane, given one of them, and now want to find the closest point to the point on the plane, how do we do it? First, establish a Cartesian coordinate system, based on which each point uniquely corresponds to one coordinate (x, y), then the Euclidean distance, and finally the distance between the word and other N-1 words, and the word that corresponds to the minimum distance is the word we are looking for.

In the above example, the coordinates (x, y) are equivalent to the word vector, which is used to quantify the position of a point on a plane mathematically. After the coordinate system is established, it is very easy to get the coordinates of a point, however, in NLP task, to get the word vector is much more complicated, and the word vector is not unique, its quality also depends on the training corpus, training algorithm and word vector length and other factors.

One way of generating word vectors is to use neural network algorithm, of course, the word vectors are usually bundled with the language model, that is, the two are obtained at the same time after training. The idea of using neural networks to train language models was first proposed by Xu Wei of the Baidu IDL (Deep Study Institute). The most classic article in this area is the bengio of a neural probabilistic Language Model, published in 2003 on JMLR, followed by a series of related research work, including the Tomas of the Google Mikolov word2v team EC (Word2vec-tool for computing continuous distributed representations of words.).

A recent understanding of the use of Word vectors in the field of machine translation, reported (new breakthroughs in machine translation) is:
Google's Tomas Mikolov team has developed a dictionary and glossary of automatic generation techniques that can turn one language into another language. This technique uses data mining to construct a structural model of two languages and then compare them. The set of relationships between words in each language is a "language space" that can be characterized as a vector set in mathematical sense. In vector space, different languages enjoy many commonalities, so long as the mapping and transformation of one vector space to another vector space is realized, language translation can be realized. The technical effect is very good, the English and Spanish translation accuracy rate of up to 90%.
I read that article (http://arxiv.org/pdf/1309.4168. pdf , introducing an example of how the algorithm works in the introduction, I think it can help us to better understand how the word vector works, as described below:
Consider two languages: English and Spanish, and get their corresponding word vector spaces E and S, respectively, by training. Take out five words from English one,two,three,four,five, set its corresponding word vector in E is v1,v2,v3,v4,v5, for the convenience of mapping, using principal component Analysis (PCA) dimensionality reduction, to obtain the corresponding two-dimensional vector u1,u2,u3,u4,u5, The five points are traced on a two-dimensional plane, as shown in the figure on the left. Similarly, in Spanish (corresponding to the one,two,three,four,five) Uno,dos,tres,cuatro,cinco, set its corresponding word vector in S, respectively, S1,S2,S3,S4,S5, with PCA The two-dimensional vectors of the descending dimension are T1,T2,T3,T4,T5, which are depicted on a two-dimensional plane (which may also need to be rotated appropriately), as shown in the figure on the right:
Observation of the left and right two pictures, easy to find: Five words in two vector space relative position, which shows that the two different languages corresponding vector space structure has similarities, thus further illustrates the use of distance in the word vector space to characterize the similarity between the rationality of the word. Jian Li Buckle,PhD, CAS Software Institute80hou, Zhang Wei, Longqiang Deng and other people agree Distrubted indicates that an individual is represented by several coding units rather than an encoding unit, that is, an individual is distributed over several coding units, mainly relative to the One-hot encoding in which an individual is represented by an encoding unit. can refer to
Deep learning for signal and information processing
"Distributed representation:a representation of the observed data in such a-to that they is
Modeled as being generated by the interactions of many hidden factors. A particular factor learned from configurations to other factors can often generalize well. Distributed representations form the basis of deep learning "published on 2013-12-30 3 reviews thank youLi Eta,Machine Learning, optimization, Comput ...Shangliang,Bernkastel,forgetthisuser agree Test instructions is actually a bit confusing. "The word vector (distributed representation) Working principle is what" both distributed representation also have the word vector, in fact, separate to see better.

Distributed Representation: Generally not one-hot representation is basically distributed representation, just a class of means of learning.

Word vector: As long as the vector used to express words can be called lyrics vector, such methods Word2vec more famous, in fact, one-hot representation can also be used as the word vector.posted on 2015-10-11 add Comment thanksShareCollection • No help · Report
Yang ChaoProgrammerLu Shengjun,Phil Zeng, clams and foxes agree. Since the main word vector (distributed representation) of the working principle of what, should not be asked to obtain the principle of vector algorithm.

Let's give a popular example.
Modern people see BMW, Mercedes-Benz These two words, the first eye of the reaction is mostly cars. But if you show it to the Ancients, the ancients must not think of the car.
Why, because the ancients did not have the relevant knowledge, can only literally understand the two words, namely < Bao, MA >,< Ben, Chi >.
To the computer, the computer saw is also the literal meaning, the two strings is unrelated (if the computer BMW and sword, it can find the two words a bit like).
How to let the computer put these two words together, this is statistical learning to do, because we have a lot of resources can be used, the computer can use some algorithms from these resources to learn the relationship between words, like human, every day to listen to people say this car is BMW, that car is Mercedes-Benz, long know that these two things are cars. But BMW in some contexts is not necessarily a car, such as the novel XX body across XX BMW, which is BMW refers to animals.

We can introduce a vector representation of words, such as:
< cars, luxury, animals, action, food >

The method of statistical learning can learn the expression of each word. It may have learned that
BMW = <0.5, 0.2, 0.2, 0.0, 0.1>
mercedes = <0.7, 0.2, 0.0, 0.1, 0.0>

In this way, two words that are literally irrelevant are linked together.


As for how to learn, two common methods:
Statistics common occurrences (LDA, a Bayesian probabilistic model).
Based on a similar context (Word2vec, or NN).
It has been tricky to this.

What does the

Word vector (distributed representation) work?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.