How to produce a good word vector

Last Update:2016-06-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to produce a good word vector?

word vector word Embedding ), presumably any one do NLP NLP task , the researchers also put forward a number of models that produce word vectors and develop useful tools for everyone to use. When using these tools to produce word vectors, different training data, parameters, models, etc. will affect the resulting word vectors, so how to produce good word vectors is important for engineering. Dr. Swi, of the Automation Institute of CAs, studied this in detail. This blog is also my reading to Dr. Published paper "

1 the expression technology of the word

In the doctoral dissertation, I outlined the existing techniques for the presentation of the main words, and I'll introduce them briefly.

1.1 single-Hot presentation technology (early traditional presentation technology)

1.2 Distribution representation technology (as opposed to the single-hot presentation technology, based on the distributed hypothesis [ that is, context-similar words, and their semantics are similar ] , the representation of information distributed in the various dimensions of vectors has a tight low dimension, which captures the characteristics of syntactic and semantic information.

Matrix-based distribution representation

Distribution representation based on clustering

the relationship between the word and its context is constructed by means of clustering. Representative Model: Brown cluster (BrownClustering).

Based on neural network distribution representation (this is the main method we are going to look at, here are a few representative models)

Neural Network language model ( NNLM )

Log bilinear language Model ( LBL )

c&w Model

Continuous Bag-of-words ( Cbow )
Skip-gram(SG)

Word2vec two models in a tool

Order Model

on top Cbow The model is summed directly in the input layer, so that it does not take into account the order of the sequence before the word, so the doctor has changed the direct summation to the sequential stitching between the word vectors to preserve the sequence order information.

Comparison of model theory

2 Experimental comparison and analysis of various models

The whole experiment was conducted around the following questions :

How to choose the right model?
What are the effects of the size of the training corpus and the field vectors?
How do I choose the parameters of the training word vector?
- Number of iterations
- Vector dimension of words

Evaluation tasks

Linguistic characteristics of Word vectors

Lexical Relevance (WS): WordSim353 the data set, the word to the semantic score. Pearson coefficient evaluation.
synonym Detection (TfL): TOEFL Data Set, the a single choice. Accuracy Rate Evaluation
The semantic analogy of single words (SEM): 9000 a problem. queen-king+man=women. Accuracy Rate
The analogy of single sentence method (SYN): 1W a problem. dancing-dance+predict=predicting. Accuracy Rate

Word vectors as Features

text classification based on mean word vector (avg): IMDB Data Set, Logistic classification. Accuracy Rate Evaluation
named entity recognition (NER): CoNLL03 data Set as an additional feature of the existing system. F1 value

Word vectors as initial values of neural network models

text categorization based on convolution (CNN): Stanford Sentiment Tree database dataset, Word vectors are not fixed. Accuracy Rate
Pos Labeling (POS): Wall Street Journal data set, Collobert and other people's proposal NN . Accuracy Rate

Experimental results (red font for bloggers to summarize their own, black font for the paper conclusion)

Model comparison

for the task of evaluating linguistic characteristics, the model of the target word is predicted by the context, which is compared with the context and the target word . c&w The model works better.
For the actual Natural language Processing task, the difference between the models is not big, choose a simple model can be.
Simple models generally perform better on small corpora, while complex models require larger corpus support.

Corpus influence

Corpus in the same field, the larger the general corpus, the better the effect.
The effect of the corpus in the field on the task of similar field is obvious, but it may even have a negative effect when the domain is not fit.
in the Natural language task, the same field corpus 10M The effect is obviously poor, but 100M the above enlarged corpus, the task result difference is small.

Trade-offs in scale and field

The domain purity of corpus is more important than corpus size. ( especially in the field of the task of the corpus of the hours, the inclusion of a large number of other areas of the corpus may have a very negative impact )

Parameter selection

Number of iterations

The number of iterations is not appropriate according to the loss function of the word vector.
If the condition permits, select the validation set performance of the target task as the reference standard.
Specific task performance metrics trends, you can choose a performance spike for simple tasks.
Use Word2vec in the tool Demo The default parameters, 15~25 The same time.

Vector dimension of words

For the task of analyzing the linguistic characteristics of word vectors, the greater the dimension, the better the effect.
for lifting natural language processing tasks, - The dimension word vectors are usually good enough. ( Here I think it can only be said that some tasks, but the trend is consistent, with the increase of the word vector dimension, the performance curve first grow and then tend to smooth, or even decline )

3 Summary

Select a suitable model. Compared with simple models, complex models have the advantage in larger corpus. ( I typically use the SG model in the word2vec tool )
Select a suitable field of corpus, in this premise, the larger the size of corpus the better. The use of large-scale corpus training, can generally improve the performance of word vectors, if the use of the field of corpus, the same field of tasks will be significantly improved. (The training corpus should not be too small, generally use the same field corpus up to 100M gauge )
when training, the termination condition of iterative optimization is best judged by the validation set of specific tasks, or similar tasks are selected as indicators, but the loss function of training word vectors should not be chosen. ( iterative parameters I generally use according to the training corpus size, generally choose 10~25 times )
the dimensions of a word vector generally need to be selected - dimension and above, especially when measuring the linguistic specificity of the word vector , the larger the dimension of the word vector, the better the effect. ( an experiment is performed based on the task of the body, and the appropriate word vector dimensions are selected based on the time required for performance and experimentation )

Main reference documents

[1] Lai S, Liu K, Xu L, et al. How to Generate a good Word embedding?. ARXIV preprint arxiv:1507.05523, 2015.

[2] Come on, Alex . Research on semantic vector representation of Word and document based on neural network . Institute of Automation, CAS, PhD dissertation, .

How to produce a good word vector

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to produce a good word vector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to produce a good word vector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support