How to produce a good word vector

Source: Internet
Author: User

How to produce a good word vector?

word vector word Embedding ), presumably any one do NLP NLP task , the researchers also put forward a number of models that produce word vectors and develop useful tools for everyone to use. When using these tools to produce word vectors, different training data, parameters, models, etc. will affect the resulting word vectors, so how to produce good word vectors is important for engineering. Dr. Swi, of the Automation Institute of CAs, studied this in detail. This blog is also my reading to Dr. Published paper "

1 the expression technology of the word

In the doctoral dissertation, I outlined the existing techniques for the presentation of the main words, and I'll introduce them briefly.

1.1 single-Hot presentation technology (early traditional presentation technology)

1.2 Distribution representation technology (as opposed to the single-hot presentation technology, based on the distributed hypothesis [ that is, context-similar words, and their semantics are similar ] , the representation of information distributed in the various dimensions of vectors has a tight low dimension, which captures the characteristics of syntactic and semantic information.

    • Matrix-based distribution representation

    • Distribution representation based on clustering

the relationship between the word and its context is constructed by means of clustering. Representative Model: Brown cluster (BrownClustering).

    • Based on neural network distribution representation (this is the main method we are going to look at, here are a few representative models)

Neural Network language model ( NNLM )

Log bilinear language Model ( LBL )

c&w Model

Continuous Bag-of-words ( Cbow )
Skip-gram(SG)

Word2vec two models in a tool

Order Model

on top Cbow The model is summed directly in the input layer, so that it does not take into account the order of the sequence before the word, so the doctor has changed the direct summation to the sequential stitching between the word vectors to preserve the sequence order information.

Comparison of model theory

2 Experimental comparison and analysis of various models

The whole experiment was conducted around the following questions :

    • How to choose the right model?
    • What are the effects of the size of the training corpus and the field vectors?
    • How do I choose the parameters of the training word vector?
      • Number of iterations
      • Vector dimension of words

Evaluation tasks

Linguistic characteristics of Word vectors

  • Lexical Relevance (WS): WordSim353 the data set, the word to the semantic score. Pearson coefficient evaluation.
  • synonym Detection (TfL): TOEFL Data Set, the a single choice. Accuracy Rate Evaluation
  • The semantic analogy of single words (SEM): 9000 a problem. queen-king+man=women. Accuracy Rate
  • The analogy of single sentence method (SYN): 1W a problem. dancing-dance+predict=predicting. Accuracy Rate

Word vectors as Features

    • text classification based on mean word vector (avg): IMDB Data Set, Logistic classification. Accuracy Rate Evaluation
    • named entity recognition (NER): CoNLL03 data Set as an additional feature of the existing system. F1 value

Word vectors as initial values of neural network models

    • text categorization based on convolution (CNN): Stanford Sentiment Tree database dataset, Word vectors are not fixed. Accuracy Rate
    • Pos Labeling (POS): Wall Street Journal data set, Collobert and other people's proposal NN . Accuracy Rate

Experimental results (red font for bloggers to summarize their own, black font for the paper conclusion)

Model comparison

    • for the task of evaluating linguistic characteristics, the model of the target word is predicted by the context, which is compared with the context and the target word . c&w The model works better.
    • For the actual Natural language Processing task, the difference between the models is not big, choose a simple model can be.
    • Simple models generally perform better on small corpora, while complex models require larger corpus support.

Corpus influence

    • Corpus in the same field, the larger the general corpus, the better the effect.
    • The effect of the corpus in the field on the task of similar field is obvious, but it may even have a negative effect when the domain is not fit.
    • in the Natural language task, the same field corpus 10M The effect is obviously poor, but 100M the above enlarged corpus, the task result difference is small.

Trade-offs in scale and field

    • The domain purity of corpus is more important than corpus size. ( especially in the field of the task of the corpus of the hours, the inclusion of a large number of other areas of the corpus may have a very negative impact )

Parameter selection

Number of iterations

    • The number of iterations is not appropriate according to the loss function of the word vector.
    • If the condition permits, select the validation set performance of the target task as the reference standard.
    • Specific task performance metrics trends, you can choose a performance spike for simple tasks.
    • Use Word2vec in the tool Demo The default parameters, 15~25 The same time.

Vector dimension of words

    • For the task of analyzing the linguistic characteristics of word vectors, the greater the dimension, the better the effect.
    • for lifting natural language processing tasks, - The dimension word vectors are usually good enough. ( Here I think it can only be said that some tasks, but the trend is consistent, with the increase of the word vector dimension, the performance curve first grow and then tend to smooth, or even decline )

3 Summary

  1. Select a suitable model. Compared with simple models, complex models have the advantage in larger corpus. ( I typically use the SG model in the word2vec tool )
  2. Select a suitable field of corpus, in this premise, the larger the size of corpus the better. The use of large-scale corpus training, can generally improve the performance of word vectors, if the use of the field of corpus, the same field of tasks will be significantly improved. (The training corpus should not be too small, generally use the same field corpus up to 100M gauge )
  3. when training, the termination condition of iterative optimization is best judged by the validation set of specific tasks, or similar tasks are selected as indicators, but the loss function of training word vectors should not be chosen. ( iterative parameters I generally use according to the training corpus size, generally choose 10~25 times )
  4. the dimensions of a word vector generally need to be selected - dimension and above, especially when measuring the linguistic specificity of the word vector , the larger the dimension of the word vector, the better the effect. ( an experiment is performed based on the task of the body, and the appropriate word vector dimensions are selected based on the time required for performance and experimentation )

Main reference documents

[1] Lai S, Liu K, Xu L, et al. How to Generate a good Word embedding?. ARXIV preprint arxiv:1507.05523, 2015.

[2] Come on, Alex . Research on semantic vector representation of Word and document based on neural network . Institute of Automation, CAS, PhD dissertation, .

How to produce a good word vector

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.