"Python Gensim using" Word2vec word vector processing English corpus

Source: Internet
Author: User

Word2vec Introduction

Word2vec official website : https://code.google.com/p/word2vec/

    • Word2vec is an open source tool for Google that calculates the distance between words and words based on the set of words entered.
    • It transforms the term into the vector form, can simplify the processing of the text content to the vector computation in the vector space, calculates the similarity degree on the vector space, to represent the text semantic similarity.
    • The Word2vec calculates the cosine value , the distance range is 0-1, the larger the value, the higher the correlation of the two words.
    • Word vectors : Use distributed representation to denote words, often referred to as "word representation" or "word embedding (embedding)."

In short: Word vector notation allows related or similar words to be closer to the distance.

Specific Use Collecting Corpus

This article :
English corpus on the Internet: Http://mattmahoney.net/dc/text8.zip
Corpus Training Information: training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s

The Corpus encoding format is UTF-8, stored as a line, long in length ... As follows:

Note :
The larger the theoretical corpus, the better.
The larger the theoretical corpus, the better.
The larger the theoretical corpus, the better.
The important thing to say three times.
It doesn't make much sense to run out of too small a corpus.

Word2vec Use

Python, using the gensim module.
Win7 system under the usual Python based on the Gensim module is not very good installation, it is recommended to use Anaconda, see: Python Development Anaconda "and Win7 installation Gensim"

directly on the code--
#!/usr/bin/env python#-*-Coding:utf-8-*-"" " function: Test Gensim use time: May 21, 2016 18:07:50" "" fromGensim.modelsImportWord2vecImportLogging# Main programLogging.basicconfig (format='% (asctime) s:% (levelname) s:% (message) s ', level=logging.info) sentences = Word2vec. Text8corpus (u "C:\\users\\lenovo\\desktop\\word2vec experiment \\text8")# Load CorpusModel = Word2vec. Word2vec (Sentences, size= $)# training Skip-gram model; default Window=5# Calculate the similarity/correlation of two wordsY1 = model.similarity ("Woman","Man")Print u "woman and man's similarity is:", y1Print "--------\ n"# Calculate the list of related words for a wordy2 = Model.most_similar ("Good", topn= -)# 20 most relevantPrint u "and good the most relevant words are: \ n" forIteminchY2:Printitem[0], item[1]Print "--------\ n"# Find CorrespondencePrint ' Boy ' was to ' father ' as ' girl ' was to ...? \ n 'Y3 = Model.most_similar ([' Girl ',' father '], [' Boy '], topn=3) forIteminchY3:Printitem[0], item[1]Print "--------\ n"More_examples = ["He He she","Big bigger Bad","Going went being"] forExampleinchMore_examples:a, b, x = Example.split () predicted = Model.most_similar ([x, b], [a]) [0][0]Print "'%s ' is to '%s ' as '%s ' was to '%s '"% (A, B, X, predicted)Print "--------\ n"# Looking for a word that's not gregariousY4 = Model.doesnt_match ("Breakfast cereal dinner Lunch". Split ())Print u "non-gregarious word:", Y4Print "--------\ n"# Save the model for reuseModel.save ("Text8.model")# The corresponding loading method# model_2 = Word2vec. Word2vec.load ("Text8.model")# Store Word vectors in a form that can be parsed by C languageModel.save_word2vec_format ("Text8.model.bin", binary=True)# The corresponding loading method# model_3 = Word2vec. Word2vec.load_word2vec_format ("Text8.model.bin", Binary=true)if__name__ = ="__main__":Pass
Run Results
the similarity between woman and man is: 0.685955257368--------And good the most relevant words are: Bad 0.739628911018poor 0.563425064087luck 0.525990724564fun 0.520761489868quick 0.518206238747really 0.491045713425practical 0.479608744383helpful 0.478456377983love 0.477012127638simple 0.475951403379useful 0.474674522877reasonable 0.473541408777safe 0.473105460405you 0.47159832716courage 0.470109701157dangerous 0.469624102116happy 0.468672126532wrong 0.467448621988easy 0.467320919037Sick 0.466005086899--------"Boy" was to "father" as "girl" are to ...?Mother 0.770967006683wife 0.718966007233grandmother 0.700566351414--------' he 'is to' his 'As' She 'is to' her '' Big 'is to' bigger 'As' bad 'is to' worse '' going ' is to ' went ' as ' being ' are to ' was '--------non-gregarious word: cereal--------
References

Deep learning: Using Word2vec and Gensim:
Http://www.open-open.com/lib/view/open1420687622546.html

"Python Gensim using" Word2vec word vector processing English corpus

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.