Choice of Word breaker tool:
Now for Chinese participle, word breaker has many kinds, such as: Jieba participle, thulac, SNOWNLP and so on. In this document, the author uses the Jieba participle, and based on the PYTHON3 environment, the choice of jieba participle reason is its relatively simple and easy to learn, easy to use, and the effect is very good.
Pre-participle preparation:
- Chinese documents for participle
- The result document after the word breaker is stored
- Chinese stop Word document (used to disable words, can be found on the internet a lot)
the result after the participle is rendered:
To stop the use of words and Chinese documents before participle
The result document after the word and participle is deactivated
word breaker and de-stop Words Code implementation:
1 ImportJieba2 3 #Create a list of inactive words4 defstopwordslist ():5Stopwords = [Line.strip () forLineinchOpen'Chinsesstoptxt.txt', encoding='UTF-8'). ReadLines ()]6 returnStopwords7 8 #Chinese participle of sentences9 defSeg_depart (sentence):Ten #Chinese word segmentation for each line in a document One Print("being participle") ASentence_depart =Jieba.cut (Sentence.strip ()) - #Create a list of inactive words -Stopwords =stopwordslist () the #output is outstr -OUTSTR ="' - #to stop using words - forWordinchSentence_depart: + ifWord not inchstopwords: - ifWord! ='\ t': +Outstr + =Word AOutstr + =" " at returnoutstr - - #give the document a path -filename ="Init.txt" -Outfilename ="OUT.txt" -inputs = open (filename,'R', encoding='UTF-8') inoutputs = open (Outfilename,'W', encoding='UTF-8') - to #writes output results to Ou.txt + forLineinchInputs: -Line_seg =Seg_depart (line) theOutputs.write (line_seg +'\ n') * Print("-------------------are word breakers and go-to-stop words-----------") $ outputs.close ()Panax Notoginseng inputs.close () - Print("remove discontinued words and participle success!!! ")
Python uses Jieba to implement Chinese document segmentation and de-stopping words