A recent practice in NLP requires the use of Word2vec (W2V) to implement semantic approximation calculations. The purpose of this paper is to implement the Gensim environment configuration and demo training and test function in Windows environment.
Word2vec is a natural language processing (NLP) framework launched by Google a few years ago that maps natural languages to data forms that computers are good at working with. The Gensim tool used in this article can be understood as a w2v Python version that enables W2V functionality in a Python environment.
To run the W2V sample, mainly to do the following work:1. Configure the environment. 2. Corpus material related operation. 3. Model training. 4. Model run test. Next we're going to one by one implementations. My operating environment is win10+python3.5.2. 1. Configure the Environment
First of all, install Python, and then use the Python PIP Package management tool to complete the Gensim package installation. This part is relatively simple, here is no longer to repeat. 2. Corpus Material related Operation
The quality and quantity of corpus in NLP are the same as the accuracy and rapidity of the algorithm, and are the important factors that affect the performance and experience of the system.
Here to ensure the integrity of the corpus, the Wikipedia's Chinese corpus for example (download address: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2) to operate. For the original corpus obtained from the Internet, two kinds of operation are needed, one is the preprocessing of corpus data and the other is participle.
Preprocessing: The data obtained from the Internet is not the text format we need, so we need to preprocess it (in this case, the action is to organize it into text format, and then simplify the traditional Chinese text), the specific code example, the code of the comments in the text can be read. The code by the user code base changes, thanks to provide code users.
Tips
1. It is easy to encounter coding errors here, and you must use encoding = "Utf-8" when creating a makefile. For the moment, I don't know why, but I'll add an explanation later.
# preprocessing data on wikis (including converting data formats and simplifying Chinese characters) from __future__ import print_function from hanziconv import hanziconv import Logging I Mport Os.path import six import sys from gensim.corpora import Wikicorpus # constant section, where you modify Dir_read = "resource\\" or by file name and path
Ignal_name = "zhwiki-latest-pages-articles.xml.bz2" result_name = "Wiki.chs.text" # Top Scope section if __name__ = ' __main__ ': # Call Logger module, output log program = Os.path.basename (sys.argv[0]) logger = Logging.getlogger (program) LOGGING.BASICC Onfig (format= '% (asctime) s:% (levelname) s:% (message) s ') Logging.root.setLevel (level=logging.info) logger.info ("Run
Ning%s '% '. Join (SYS.ARGV) # Enter the path Sys.argv.extend ([Dir_read+orignal_name,dir_read+result_name]) that requires preprocessing files and makefile. If Len (sys.argv)!= 3:print ("Using:python process_wiki.py enwiki.xxx.xml.bz2 wiki.chs.txt") sys.exit (1) inp, OUTP = sys.argv[1:3] Space = "" i = 0 # use Wikicorpus to convert corpus to corresponding format output = open (Outp, ' W ', encod
ing= "Utf-8") # default is GBK Wiki = Wikicorpus (INP, Lemmatize=false, dictionary={},) for text in Wiki.get_texts (): if six. Py3:temp_string = B '. Join (text). Decode ("Utf-8") use the Hanziconv module to simplify the traditional content of the text temp_s Tring = hanziconv.tosimplified (temp_string) output.write (temp_string + ' \ n ') Else:temp_st Ring = space.join (text) temp_string = hanziconv.tosimplified (temp_string) output.write (temp_string + "\ n") i + 1 if (i% 1000 = 0): Logger.info ("Saved" + str (i) + "articles") output.cl
OSE () Logger.info ("Finished Saved" + str (i) + "articles")
Participle: I use USTC nlpir to achieve the word segmentation function, official documents see http://pynlpir.readthedocs.io/en/latest/installation.html. Installation is not difficult, here is not much to say. The example code is as follows, the idea is to remove the original space in the corpus, and then append a space to the divided words.
#-*-Coding:utf-8-*-
import pynlpir
# constant section, where modify Dir_read = "according to file name and path
. \\resource\\ "
orignal_file =" Material.text "
result_file =" Splitword.text "
# Top Scope section
if __name__ =" __main__ ":
F_ori = open (Dir_read+orignal_file," R ", encoding=" Utf-8 ")
F_result = open (Dir_read+result_file," w+ ", encoding=" Utf-8 ")
Pynlpir.open () # load word breaker
seq = 0 while
True:
if seq%1000 = 0:
print (" Divided ", seq," row data ")
seq = 1
temp_string = f_ori.readline ()
If temp_string =" ":
break
try:
Temp_split = Pynlpir.segment (temp_string,pos_tagging=false) # using a word breaker for the
temp_split_element in Temp_split:
if temp_split_element = = "":
continue
Else:
f_result.write (temp_split_element+ "")
Except Unicodedecodeerror:
print ("Problem with Corpus")
print ("Finish")
3. Model Training
The model training here is very simple, the core code is two lines, the code is as follows.
#-*-Coding:utf-8-*-# training Word vector Model import logging import OS import sys import multiprocessing from gensim.models Import Word2vec from Gensim.models.word2vec import linesentence # constant Part dir_read = ".
\\resource\\ "inp_file =" Splitword.text "Outp_model =" Medical.model.text "outp_vector =" Medical.vector.text "# Top-level scope section if __name__ = = ' __main__ ': Sys.argv.extend ([Dir_read+inp_file,dir_read+outp_model,dir_read+outp_vector]) program = Os.path.basename (sys.argv[0]) logger = Logging.getlogger (program) logging.basicconfig (format= '% (asctime) s:% (l
Evelname) S:% (message) s ') Logging.root.setLevel (level=logging.info) logger.info ("running%s"% '. Join (SYS.ARGV))
# Check and process input arguments if Len (SYS.ARGV) < 4:print (Globals () [' __doc__ ']% locals ()) Sys.exit (1) INP, OUTP1, OUTP2 = sys.argv[1:4] model = Word2vec (Linesentence (INP), size=400, window=5, Min_cou
Nt=5, Workers=multiprocessing.cpu_count ()) # trim unneeded model memory = use (much) less RAM # model.init_sims (replace=true) model.save (OUTP1) MODEL.WV
. Save_word2vec_format (OUTP2, Binary=false)
4. Model Run test
Training is done using the original corpus, which will produce the following files.
#-*-Coding:utf-8-*-
# Run model Test
import Gensim
# constant Part
dir_read = ' resource\\ ' model_name
= ' Wiki.chs. Model.text "
# Top Scope part
if __name__ = =" __main__ ":
model = gensim.models.Word2Vec.load (Dir_read + model_name Result
= Model.most_similar ("Henan") for
E in result:
print (e[0],e[1])
The screenshot of the test is as follows.