Implementation of WORD2VEC model training and test in WIN10 environment using Gensim

Source: Internet
Author: User
Tags bz2

A recent practice in NLP requires the use of Word2vec (W2V) to implement semantic approximation calculations. The purpose of this paper is to implement the Gensim environment configuration and demo training and test function in Windows environment.
Word2vec is a natural language processing (NLP) framework launched by Google a few years ago that maps natural languages to data forms that computers are good at working with. The Gensim tool used in this article can be understood as a w2v Python version that enables W2V functionality in a Python environment.
To run the W2V sample, mainly to do the following work:1. Configure the environment. 2. Corpus material related operation. 3. Model training. 4. Model run test. Next we're going to one by one implementations. My operating environment is win10+python3.5.2. 1. Configure the Environment

First of all, install Python, and then use the Python PIP Package management tool to complete the Gensim package installation. This part is relatively simple, here is no longer to repeat. 2. Corpus Material related Operation

The quality and quantity of corpus in NLP are the same as the accuracy and rapidity of the algorithm, and are the important factors that affect the performance and experience of the system.
Here to ensure the integrity of the corpus, the Wikipedia's Chinese corpus for example (download address: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2) to operate. For the original corpus obtained from the Internet, two kinds of operation are needed, one is the preprocessing of corpus data and the other is participle.
Preprocessing: The data obtained from the Internet is not the text format we need, so we need to preprocess it (in this case, the action is to organize it into text format, and then simplify the traditional Chinese text), the specific code example, the code of the comments in the text can be read. The code by the user code base changes, thanks to provide code users.
Tips
1. It is easy to encounter coding errors here, and you must use encoding = "Utf-8" when creating a makefile. For the moment, I don't know why, but I'll add an explanation later.

# preprocessing data on wikis (including converting data formats and simplifying Chinese characters) from __future__ import print_function from hanziconv import hanziconv import Logging I Mport Os.path import six import sys from gensim.corpora import Wikicorpus # constant section, where you modify Dir_read = "resource\\" or by file name and path
    Ignal_name = "zhwiki-latest-pages-articles.xml.bz2" result_name = "Wiki.chs.text" # Top Scope section if __name__ = ' __main__ ': # Call Logger module, output log program = Os.path.basename (sys.argv[0]) logger = Logging.getlogger (program) LOGGING.BASICC Onfig (format= '% (asctime) s:% (levelname) s:% (message) s ') Logging.root.setLevel (level=logging.info) logger.info ("Run
    Ning%s '% '. Join (SYS.ARGV) # Enter the path Sys.argv.extend ([Dir_read+orignal_name,dir_read+result_name]) that requires preprocessing files and makefile. If Len (sys.argv)!= 3:print ("Using:python process_wiki.py enwiki.xxx.xml.bz2 wiki.chs.txt") sys.exit (1) inp, OUTP = sys.argv[1:3] Space = "" i = 0 # use Wikicorpus to convert corpus to corresponding format output = open (Outp, ' W ', encod
  ing= "Utf-8") # default is GBK  Wiki = Wikicorpus (INP, Lemmatize=false, dictionary={},) for text in Wiki.get_texts (): if six. Py3:temp_string = B '. Join (text). Decode ("Utf-8") use the Hanziconv module to simplify the traditional content of the text temp_s Tring = hanziconv.tosimplified (temp_string) output.write (temp_string + ' \ n ') Else:temp_st  Ring = space.join (text) temp_string = hanziconv.tosimplified (temp_string) output.write (temp_string + "\ n") i + 1 if (i% 1000 = 0): Logger.info ("Saved" + str (i) + "articles") output.cl
 OSE () Logger.info ("Finished Saved" + str (i) + "articles")

Participle: I use USTC nlpir to achieve the word segmentation function, official documents see http://pynlpir.readthedocs.io/en/latest/installation.html. Installation is not difficult, here is not much to say. The example code is as follows, the idea is to remove the original space in the corpus, and then append a space to the divided words.

#-*-Coding:utf-8-*-

import pynlpir


# constant section, where modify Dir_read = "according to file name and path
. \\resource\\ "
orignal_file =" Material.text "
result_file =" Splitword.text "


# Top Scope section
if __name__ =" __main__ ":
    F_ori = open (Dir_read+orignal_file," R ", encoding=" Utf-8 ")
    F_result = open (Dir_read+result_file," w+ ", encoding=" Utf-8 ")
    Pynlpir.open () # load word breaker
    seq = 0 while
    True:
        if seq%1000 = 0:
            print (" Divided ", seq," row data ")
        seq = 1
        temp_string = f_ori.readline ()
        If temp_string =" ":
            break
        try:
            Temp_split = Pynlpir.segment (temp_string,pos_tagging=false) # using a word breaker for the
            temp_split_element in Temp_split:
                if temp_split_element = = "":
                    continue
                Else:
                    f_result.write (temp_split_element+ "")
        Except Unicodedecodeerror:
            print ("Problem with Corpus")
print ("Finish")
3. Model Training

The model training here is very simple, the core code is two lines, the code is as follows.

#-*-Coding:utf-8-*-# training Word vector Model import logging import OS import sys import multiprocessing from gensim.models Import Word2vec from Gensim.models.word2vec import linesentence # constant Part dir_read = ". 

\\resource\\ "inp_file =" Splitword.text "Outp_model =" Medical.model.text "outp_vector =" Medical.vector.text "# Top-level scope section  if __name__ = = ' __main__ ': Sys.argv.extend ([Dir_read+inp_file,dir_read+outp_model,dir_read+outp_vector]) program = Os.path.basename (sys.argv[0]) logger = Logging.getlogger (program) logging.basicconfig (format= '% (asctime) s:% (l 

    Evelname) S:% (message) s ') Logging.root.setLevel (level=logging.info) logger.info ("running%s"% '. Join (SYS.ARGV))
        # Check and process input arguments if Len (SYS.ARGV) < 4:print (Globals () [' __doc__ ']% locals ()) Sys.exit (1) INP, OUTP1, OUTP2 = sys.argv[1:4] model = Word2vec (Linesentence (INP), size=400, window=5, Min_cou

  Nt=5, Workers=multiprocessing.cpu_count ())  # trim unneeded model memory = use (much) less RAM # model.init_sims (replace=true) model.save (OUTP1) MODEL.WV
 . Save_word2vec_format (OUTP2, Binary=false)

4. Model Run test

Training is done using the original corpus, which will produce the following files.

#-*-Coding:utf-8-*-
# Run model Test

import Gensim

# constant Part

dir_read = ' resource\\ ' model_name
= ' Wiki.chs. Model.text "

# Top Scope part

if __name__ = =" __main__ ":
    model = gensim.models.Word2Vec.load (Dir_read + model_name Result
    = Model.most_similar ("Henan") for
    E in result:
        print (e[0],e[1])

The screenshot of the test is as follows.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.