Python version of Word2vector--Gensim Learning Codex The similarity measure of Chinese words

Source: Internet
Author: User

Objective

Related Content Link: The first section: Google Word2vec Learning Codex
Yesterday finally tried a bit Google's own Word2vector source code, took a good long time training data, the results found that it seems that Python can not be used directly, so the internet to find a python can use the word2vector, so a look, We found Gensim.

Gensim (should turn over the wall):
Http://radimrehurek.com/gensim/models/word2vec.html

Installation

Gensim have some dependencies, first make sure you install these things:

Python >=2.6. Tested withVersions2.6,2.7,3.3,3.4  and 3.5. Support forPython2.5was discontinued starting Gensim0.10. 0;ifYou must use Python2.5, install Gensim0.9. 1. NumPy >=1.3. Tested with version 1.9. 0,1.7. 1,1.7. 0,1.6. 2,1.6. 1RC2,1.5. 0RC1,1.4. 0,1.3. 0,1.3. 0RC2. SciPy >=0.7. Tested with version 0.14. 0,0.12. 0,0.11. 0,0.10. 1,0.9. 0,0.8. 0,0.8. 0B1,0.7. 1,0.7. 0.

There is also a point of special attention is to ensure that your system has C compiler, otherwise the speed will be very slow, in fact, you can first compile Google's official C language version of the test, and then install Gensim,gensim word2vector with the official code

There are two ways to choose from the Installation Guide on the website:
Using Easy_install or PIP, note that both may require sudo to request higher permissions

easy_install-Ugensim或者(这个相对于官网的,我修改过,实测我的没问题)pipinstall--upgrade--ignore-installedsixgensim

I used the second way of installing, if these dependencies are not installed, you can install Python and related tools directly after using PIP or Easy_install installation.

In the model training, if you do not install Cython, can not be multithreaded training, the speed is very thin impact, so then install the next Cython

install cython

1. Training Model:
If all the installation configuration work is done, you can start using Gensim. The corpus here uses the Corpus-seg.txt corpus that has been well-divided in my previous blog. Here, after completing the model training, save him to a file so that it can be used directly next time.

Blog Link: Google Word2vec learning Codex

# Coding:utf-8ImportSysreload (SYS) sys.setdefaultencoding ("Utf-8") fromGensim.modelsImportWord2vecImportLogging,gensim,os class textloader(object):     def __init__(self):        Pass     def __iter__(self):input = open (' Corpus-seg.txt ',' R ') line = str (input.readline ()) counter =0         whileline!=None  andLen (line) >4:#print Linesegments = Line.split ("')yieldSegments line = str (Input.readline ()) sentences = Textloader () model = gensim.models.Word2Vec (sentences, workers=8) Model.save (' Word2vector2.model ')Print ' OK '

Here the file loaded with its own code, of course, you can also use the line sentence, the reason for the above code is because, if your file format is more special can be referred to the above code for processing.

# Coding:utf-8Import sysreload (SYS) sys.setdefaultencoding ("Utf-8") fromGensim.models Import Word2vecimport Logging,gensim,os#模型的加载Model = Word2vec.Load(' Word2vector.model ')#比较两个词语的相似度, the higher the better.PrintThe similarity between " Tangshan" and "China": '+ STR (model.similarity (' Tangshan ',' China ')) Print (The similarity between " China" and "Motherland": '+ STR (model.similarity (' Motherland ',' China ')) Print (The similarity between " China" and "China": '+ STR (model.similarity (' China ',' China ')))#使用一些词语来限定, divided into positive and negativeresult= Model.most_similar (positive=[' China ',' City '], negative=[' Students ']) Print (The word " China" and "city" are close to two, but the words that are not close to "students" are: ') for Item inch result: Print (' "'+Item[0]+' similarity: '+STR (Item[1]))result= Model.most_similar (positive=[' man ',' Rights '], negative=[' Woman ']) Print (The word "men" and "rights" are close, but the words that are not close to "women" are: ') for Item inch result: Print (' "'+Item[0]+' similarity: '+STR (Item[1]))result= Model.most_similar (positive=[' Woman ',' Law '], negative=[' man ']) Print ("The same as" woman "and" law "close, but the" man "is not close to the word has: ') for Item inch result: Print (' "'+Item[0]+' similarity: '+STR (Item[1]))#从一堆词里面找到不匹配的Print"Teacher Student class principal, which is not match?" Word2vec The result says: "+model.doesnt_match ("Teachers, students, principals.".Split())) print ("Car train bike camera, is there a mismatch?" Word2vec The result says: "+model.doesnt_match ("Car train bike Camera".Split())) print ("Rice white blue green red, which one is not match?" Word2vec The result says: "+model.doesnt_match ("Rice white blue green Red".Split()))#直接查看某个词的向量Print' China's eigenvector is: ') Print (model[' China '])

Here is a result of my operation:

/system/Library/frameworks/python.framework/versions/2.7/bin/python2. 7/users/mebiuw/documents/doing/bot/word2vector/model_loader.py"Tangshan"And"China"Degree of similarity:0.1720725224"China"And"The Motherland"Degree of similarity:0.456236474841"China"And"China"Degree of similarity:1.0With"China"And"City"The two words are close, but with"Student"The words that are not close are:"Global"Degree of similarity:0.60819453001   "Asia"Degree of similarity:0.588450014591   "China"Degree of similarity:0.545840501785   "The World"Degree of similarity:0.540009200573   "Famous city"Degree of similarity:0.518879711628   "Silicon Valley"Degree of similarity:0.517688155174   "Yangtze River Delta"Degree of similarity:0.512072384357   "Domestic"Degree of similarity:0.511703968048   "National"Degree of similarity:0.507433652878   "International"Degree of similarity:0.505781650543With"Man"And"Rights"Close, but with"Woman"The words that are not close are:"Benefits"Degree of similarity:0.67150759697   "Privacy"Degree of similarity:0.666741013527   "Suffrage"Degree of similarity:0.626420497894   "Property Rights"Degree of similarity:0.617758154869   "Benefits"Degree of similarity:0.610122740269   "Obligation"Degree of similarity:0.608267366886   "Dignity"Degree of similarity:0.605125784874   "Inheritance"Degree of similarity:0.603345394135   "Law"Degree of similarity:0.596215546131   "Priority"Degree of similarity:0.59428691864With"Woman"And"Law"Close, but with"Man"The words that are not close are:"Labor Law"Degree of similarity:0.652353703976   "Justice"Degree of similarity:0.652238130569   "Marriage Law"Degree of similarity:0.631354928017   "Civil law"Degree of similarity:0.624598622322   "Regulations"Degree of similarity:0.623348236084   "Criminal Law"Degree of similarity:0.611774325371   "International Law"Degree of similarity:0.608191132545   "Litigation"Degree of similarity:0.607495307922   "Reach"Degree of similarity:0.599701464176   "Force Force"Degree of similarity:0.597045660019Teacher Student class principal, which one is not match? Word2vec The result is: Class car train bike camera, which one is not match? Word2vec The result says: Camera rice white blue green red, which one is not match? The Word2vec results say: the characteristic vectors of rice China are: [-0.08299727-3.58397388-0.55335367  1.4152931   3.94189262-2.03232622  1.31824613-1.75067747-1.66100371-1.70273054-3.47409034  2.70463562-0.87696695-2.53364205-2.12181163-7.60758495-0.6421982   2.9187181  1.38164878-0.05457138  1.02129567  1.64029694  0.21894537-0.82295948  3.30296516-0.65931851  1.39501953  0.71423614  2.0213325   2.97903037  1.46234405-0.30748805  2.45258284-0.51123774-1.84140313-0.92091084-4.28990364  4.0552578-2.01020265  0.85769647-4.6681509-2.88254309-1.80714786  0.52874494  3.31922817  0.43049669-3.03839922-1.20092583  2.75143361  0.99246925  0.41537657-0.78819919  1.28469515  0.12056304-4.54702759-1.36031103  0.35673267-0.36477017-3.63630986-0.21103215  2.16747832-0.47925043-0.63043374-2.25911093-1.47486925  4.2380085-0.22334123  3.2125628   0.91901672  0.66508955-2.80306172  3.42943978  2.26001453  5.24837303-4.0164156-3.28324246  4.40493822-0.14068756-4.31880903  1.98531461  0.2576215-2.69446373  0.59171939-0.48250189-0.67274201  1.96152794-2.83031917  0.54468328  2.57930231-1.44152164-0.61808151  1.03311574-3.48526216-2.35903311-3.9816277-0.93071622  2.77195001  1.8912288-3.45096016  4.93347549]ProcessFinished with ExitCode0

At present, this codex only introduces a few of the installation and use, more work will be written in the subsequent blog.

Use the problems you may encounter:

ValueError:numpy.dtype has the wrong size, try recompiling:
Http://stackoverflow.com/questions/17709641/valueerror-numpy-dtype-has-the-wrong-size-try-recompiling

Resources

1. Official Tutorial: http://radimrehurek.com/gensim/models/word2vec.html
1.1 Official Tutorial Translation: http://blog.csdn.net/Star_Bob/article/details/47808499
2, Word2vec word vector processing English corpus: http://blog.csdn.net/churximi/article/details/51472203
3, Google Word2vec Learning Codex: http://blog.csdn.net/mebiuw/article/details/52295138

Python version of Word2vector--Gensim Learning Codex Chinese word similarity measure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.