Nlp︱r language implementation of Word2vec (Word vector) experience Summary (disambiguation, Word vector additive)

Source: Internet
Author: User

R language because of the efficiency problem, the realization of natural language processing analysis will be affected, how to improve the efficiency and improve the accuracy of the word vector is in the current software environment, compared with the need to solve the problem.

The author thinks that there are still some problems:

1, how to improve the operating efficiency of large-scale corpus in the R language environment?

2, how to improve the accuracy of the word vector, or how to measure the degree of the word vector?

3. What are the functional functions of word vectors that are worth developing?

4. How to eliminate the ambiguity in semantics?

5, the word vector from the "word" to "phrase" leap?


Reprint please specify the source and the author (Matt), welcome to enjoy natural language processing together ~


——————————————————————————————————————————————————————


What are the Word2vec packages in the R language?


R language in the word vector package is still relatively small, and most of the applications are not perfect, I found that Li Shi teacher wrote Tm.word2vec bag

the implementation of the Word2vec of the deep learning of ︱ text mining in the R language

Tm.word2vec package inside the content is too little, only one call function is more effective, so Li Shi teacher and on GitHub wrote a Word2vec function, but this function call is not particularly convenient.


So abroad there is a god-man, on the basis of Li Shi teacher, learn from Li Shi teacher Word2vec function, developed their own package, wordvectors package (1000W words, 4 threads, 20min or so), this package is quite excellent, not only all integrated Li Shi teacher function Advantages (can be multi-threaded operation, Custom dimensions, custom models, and how to read output files, disambiguation, word cloud, word similarity, and so on.


Recently found the other two: one is Text2vec, one is Rword2vec. Among them, Text2vec is now the main research direction:

Introduction to Heavyweight ︱r+nlp:text2vec package (glove word vector, LDA topic model, various distance calculations, etc.)


——————————————————————————————————————————————————————


first, how to improve the operation efficiency of large-scale corpus in the R language environment?


Starting from training parameters and optimizing training speed.


1. Training Parameters


The selection of training parameters is the key to improving efficiency, some experience parameter training experience (part of the source of the Bridge Flow blog):
    1. window in 5~8, I use 8, feel good, cbow generally in the 5,skip in about 10 more suitable;
    2. The other can be consulted:

· Architecture: Skip-gram (slow, advantageous for rare words) vs Cbow (FAST)

· Training algorithm: Layered Softmax (advantageous to rare words) vs negative sampling (favorable for common words and low latitude vectors)

· Under-sampling frequent words: can improve the accuracy and speed of results (range 1e-3 to 1e-5)

· Text (window) Size: Skip-gram usually around 10, Cbow usually around 5



2. Optimize Training speed


(part of the Source Bridge water blog)

  1. Choose Cbow model, according to experience Cbow model is much faster than Skip-gram model, and the effect is not worse than Skip-gram, feel good point;

  2. The number of threads is set to match the number of CPU cores;

  3. The number of iterations 5 times is almost ready;


——————————————————————————————————————————————————————


second, how to improve the accuracy of the word vector, or how to measure the degree of the word vector?


1, the dimension, in general, the more dimensions the better (300-dimensional relatively good), of course, there are exceptions;

2, training data set size and quality. The larger the training data set, the better, the coverage, and the quality should be as good as possible.

3, the parameter setting, generally such as windows,iter, the structure chooses compares the correlation.


——————————————————————————————————————————————————————


third, the functional role of the word vector also what is worth developing?


1, the additive of the word vector


Word vectors have a very large potential, which is the addition of vectors, such as two cases:

Vector (Paris)-vector (France) +vector (Italy) ≈vector (Rome)

Vector (King)-vector (man) +vector (woman) ≈vector (Queen)

The approximate process is that King's Woman is about equal to Queen, of course, why subtract man, here man will disturb King word, so subtract.


2. Eliminate ambiguity


Above King-man is a way to eliminate ambiguity, here to use the way of linear algebra, King-man after the man this layer of meaning eliminated.

However, the first large-scale identification of ambiguous words, pending further study.


3. Word Clustering


By clustering, you can dig up some derivative words about a word, or you can use it when looking for the same topic.



4, Word vector phrase combination word2phrase


using Word vectors to construct some phrase combinations, two steps are explored:

(1) How do words link up? (Reference paper)

(2) Link up, in what way to record the combination phrase? --Average

For example, "China River" to become a special phrase, then you can use the "China" + "river" vector of the average to be expressed, and then use this word vector to find some nearest neighbor words .


5. Synonyms attribute


The word vector can obtain a good property, except additive, which is approximate. Nearby synonyms can be aggregated, of course, the quality of the word vector depends on the training corpus of good or bad. At the same time, in the synonyms, whether the antonym can be identified, but also a topic worthy of study.




Nlp︱r language implementation of Word2vec (Word vector) experience Summary (disambiguation, Word vector additive)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.