Word2vec Introduction
Word2vec official website : https://code.google.com/p/word2vec/
- Word2vec is an open source tool for Google that calculates the distance between words and words based on the set of words entered.
- It transforms the term into the vector form, can simplify the processing of the text content to the vector computation in the vector space, calculates the similarity degree on the vector space, to represent the text semantic similarity.
- The Word2vec calculates the cosine value , the distance range is 0-1, the larger the value, the higher the correlation of the two words.
- Word vectors : Use distributed representation to denote words, often referred to as "word representation" or "word embedding (embedding)."
In short: Word vector notation allows related or similar words to be closer to the distance.
Specific Use
Collecting Corpus
This article :
English corpus on the Internet: Http://mattmahoney.net/dc/text8.zip
Corpus Training Information: training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s
The Corpus encoding format is UTF-8, stored as a line, long in length ... As follows:
Note :
The larger the theoretical corpus, the better.
The larger the theoretical corpus, the better.
The larger the theoretical corpus, the better.
The important thing to say three times.
It doesn't make much sense to run out of too small a corpus.
Word2vec Use
Python, using the gensim module.
Win7 system under the usual Python based on the Gensim module is not very good installation, it is recommended to use Anaconda, see: Python Development Anaconda "and Win7 installation Gensim"
directly on the code--
#!/usr/bin/env python#-*-Coding:utf-8-*-"" " function: Test Gensim use time: May 21, 2016 18:07:50" "" fromGensim.modelsImportWord2vecImportLogging# Main programLogging.basicconfig (format='% (asctime) s:% (levelname) s:% (message) s ', level=logging.info) sentences = Word2vec. Text8corpus (u "C:\\users\\lenovo\\desktop\\word2vec experiment \\text8")# Load CorpusModel = Word2vec. Word2vec (Sentences, size= $)# training Skip-gram model; default Window=5# Calculate the similarity/correlation of two wordsY1 = model.similarity ("Woman","Man")Print u "woman and man's similarity is:", y1Print "--------\ n"# Calculate the list of related words for a wordy2 = Model.most_similar ("Good", topn= -)# 20 most relevantPrint u "and good the most relevant words are: \ n" forIteminchY2:Printitem[0], item[1]Print "--------\ n"# Find CorrespondencePrint ' Boy ' was to ' father ' as ' girl ' was to ...? \ n 'Y3 = Model.most_similar ([' Girl ',' father '], [' Boy '], topn=3) forIteminchY3:Printitem[0], item[1]Print "--------\ n"More_examples = ["He He she","Big bigger Bad","Going went being"] forExampleinchMore_examples:a, b, x = Example.split () predicted = Model.most_similar ([x, b], [a]) [0][0]Print "'%s ' is to '%s ' as '%s ' was to '%s '"% (A, B, X, predicted)Print "--------\ n"# Looking for a word that's not gregariousY4 = Model.doesnt_match ("Breakfast cereal dinner Lunch". Split ())Print u "non-gregarious word:", Y4Print "--------\ n"# Save the model for reuseModel.save ("Text8.model")# The corresponding loading method# model_2 = Word2vec. Word2vec.load ("Text8.model")# Store Word vectors in a form that can be parsed by C languageModel.save_word2vec_format ("Text8.model.bin", binary=True)# The corresponding loading method# model_3 = Word2vec. Word2vec.load_word2vec_format ("Text8.model.bin", Binary=true)if__name__ = ="__main__":Pass
Run Results
the similarity between woman and man is: 0.685955257368--------And good the most relevant words are: Bad 0.739628911018poor 0.563425064087luck 0.525990724564fun 0.520761489868quick 0.518206238747really 0.491045713425practical 0.479608744383helpful 0.478456377983love 0.477012127638simple 0.475951403379useful 0.474674522877reasonable 0.473541408777safe 0.473105460405you 0.47159832716courage 0.470109701157dangerous 0.469624102116happy 0.468672126532wrong 0.467448621988easy 0.467320919037Sick 0.466005086899--------"Boy" was to "father" as "girl" are to ...?Mother 0.770967006683wife 0.718966007233grandmother 0.700566351414--------' he 'is to' his 'As' She 'is to' her '' Big 'is to' bigger 'As' bad 'is to' worse '' going ' is to ' went ' as ' being ' are to ' was '--------non-gregarious word: cereal--------
References
Deep learning: Using Word2vec and Gensim:
Http://www.open-open.com/lib/view/open1420687622546.html
"Python Gensim using" Word2vec word vector processing English corpus