"Python Gensim using" Word2vec word vector processing English corpus

Last Update:2016-05-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Word2vec Introduction

Word2vec official website : https://code.google.com/p/word2vec/

Word2vec is an open source tool for Google that calculates the distance between words and words based on the set of words entered.
It transforms the term into the vector form, can simplify the processing of the text content to the vector computation in the vector space, calculates the similarity degree on the vector space, to represent the text semantic similarity.
The Word2vec calculates the cosine value , the distance range is 0-1, the larger the value, the higher the correlation of the two words.
Word vectors : Use distributed representation to denote words, often referred to as "word representation" or "word embedding (embedding)."

In short: Word vector notation allows related or similar words to be closer to the distance.

Specific Use Collecting Corpus

This article :
English corpus on the Internet: Http://mattmahoney.net/dc/text8.zip
Corpus Training Information: training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s

The Corpus encoding format is UTF-8, stored as a line, long in length ... As follows:

Note :
The larger the theoretical corpus, the better.
The larger the theoretical corpus, the better.
The larger the theoretical corpus, the better.
The important thing to say three times.
It doesn't make much sense to run out of too small a corpus.

Word2vec Use

Python, using the gensim module.
Win7 system under the usual Python based on the Gensim module is not very good installation, it is recommended to use Anaconda, see: Python Development Anaconda "and Win7 installation Gensim"

directly on the code--

#!/usr/bin/env python#-*-Coding:utf-8-*-"" " function: Test Gensim use time: May 21, 2016 18:07:50" "" fromGensim.modelsImportWord2vecImportLogging# Main programLogging.basicconfig (format='% (asctime) s:% (levelname) s:% (message) s ', level=logging.info) sentences = Word2vec. Text8corpus (u "C:\\users\\lenovo\\desktop\\word2vec experiment \\text8")# Load CorpusModel = Word2vec. Word2vec (Sentences, size= $)# training Skip-gram model; default Window=5# Calculate the similarity/correlation of two wordsY1 = model.similarity ("Woman","Man")Print u "woman and man's similarity is:", y1Print "--------\ n"# Calculate the list of related words for a wordy2 = Model.most_similar ("Good", topn= -)# 20 most relevantPrint u "and good the most relevant words are: \ n" forIteminchY2:Printitem[0], item[1]Print "--------\ n"# Find CorrespondencePrint ' Boy ' was to ' father ' as ' girl ' was to ...? \ n 'Y3 = Model.most_similar ([' Girl ',' father '], [' Boy '], topn=3) forIteminchY3:Printitem[0], item[1]Print "--------\ n"More_examples = ["He He she","Big bigger Bad","Going went being"] forExampleinchMore_examples:a, b, x = Example.split () predicted = Model.most_similar ([x, b], [a]) [0][0]Print "'%s ' is to '%s ' as '%s ' was to '%s '"% (A, B, X, predicted)Print "--------\ n"# Looking for a word that's not gregariousY4 = Model.doesnt_match ("Breakfast cereal dinner Lunch". Split ())Print u "non-gregarious word:", Y4Print "--------\ n"# Save the model for reuseModel.save ("Text8.model")# The corresponding loading method# model_2 = Word2vec. Word2vec.load ("Text8.model")# Store Word vectors in a form that can be parsed by C languageModel.save_word2vec_format ("Text8.model.bin", binary=True)# The corresponding loading method# model_3 = Word2vec. Word2vec.load_word2vec_format ("Text8.model.bin", Binary=true)if__name__ = ="__main__":Pass

Run Results

the similarity between woman and man is: 0.685955257368--------And good the most relevant words are: Bad 0.739628911018poor 0.563425064087luck 0.525990724564fun 0.520761489868quick 0.518206238747really 0.491045713425practical 0.479608744383helpful 0.478456377983love 0.477012127638simple 0.475951403379useful 0.474674522877reasonable 0.473541408777safe 0.473105460405you 0.47159832716courage 0.470109701157dangerous 0.469624102116happy 0.468672126532wrong 0.467448621988easy 0.467320919037Sick 0.466005086899--------"Boy" was to "father" as "girl" are to ...?Mother 0.770967006683wife 0.718966007233grandmother 0.700566351414--------' he 'is to' his 'As' She 'is to' her '' Big 'is to' bigger 'As' bad 'is to' worse '' going ' is to ' went ' as ' being ' are to ' was '--------non-gregarious word: cereal--------

References

Deep learning: Using Word2vec and Gensim:
Http://www.open-open.com/lib/view/open1420687622546.html

"Python Gensim using" Word2vec word vector processing English corpus

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Python Gensim using" Word2vec word vector processing English corpus

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Python Gensim using" Word2vec word vector processing English corpus

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support