Write a simple Chinese Word divider in Python

Last Update:2013-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Extract the following files: training data: icwb2-data/training/pku _ training. utf8 test data: icwb2-data/testing/pku _ test. utf8 correct word segmentation: icwb2-data/gold/pku _ test _ gold. utf8 scoring tool: icwb2-data/script/socre2 Algorithm Description algorithm is the simplest forward maximum matching (FMM): generate a dictionary with training data to scan test data from left to right, when a longest word is encountered, it is split until the sentence ends. Note: This is the initial algorithm. The code can be controlled within 60 rows, later, the test results showed that the number problem was not well handled, and the processing of the number was added. 3 source code and comments

#! /Usr/bin/env python #-*-coding: UTF-8-*-# Author: minix # Date: 2013-03-20 # Email: minix007@foxmail.comimport codecsimport sys # some special symbols processed by rules numMath = [u '0', u '1', u '2', u '3', u '4 ', u'5', u'6', u'7', u'8', u'9'] numMath_suffix = [U '. ', U' %', u '100 ', u 'wan', u 'kil', u '100', u '10 ', u'and'] numCn = [u'1', u'2', u'3', u'4', u'5', u'6 ', u '7', u '8', u '9', u '〇 ', u '0'] numCn_suffix_date = [u 'Year', u 'month ', u 'day'] numCn_suffix_unit = [u '100 ', u 'wan', u 'kil', u '100', u '10 ', u's'] special_char = [U' (', U')'] def proc_num_math (line, start ): "processing mathematical symbols in sentences" oldstart = start while line [start] in numMath or line [start] in numMath_suffix: start = start + 1 if line [start] in numCn_suffix_date: start = start + 1 return start-oldstartdef proc_num_cn (line, start ): "Processing Chinese numbers in sentences" oldstart = start while line [start] in numCn or line [start] in numCn_suffix_unit: start = start + 1 if line [start] in numCn_suffix_date: start = start + 1 return start-oldstartdef rules (line, start ): "handling special rules" if line [start] in numMath: return proc_num_math (line, start) elif line [start] in numCn: return proc_num_cn (line, start) def genDict (path): "Get Dictionary" f = codecs. open (path, 'R', 'utf-8') contents = f. read () contents = contents. replace (U' \ R', u'') contents = contents. replace (U' \ n', u'') # Separate the file content by spaces. mydict = contents. split (u'') # Remove duplicate newdict = List (set (mydict) newdict in the dictionary list. remove (u'') # create a dictionary # key is the first word, and value is the List truedict = {} for item in newdict: if len (item)> 0 and item [0] in truedict: value = truedict [item [0] value. append (item) truedict [item [0] = value else: truedict [item [0] = [item] return truedictdef print_unicode_list (uni_list): for item in uni_list: print item, def divideWords (mydict, sentence, cut it down until the sentence is closed. "" ruleChar = [] ruleChar. extend (numCn) ruleChar. extend (numMath) result = [] start = 0 senlen = len (sentence) while start <senlen: curword = sentence [start] maxlen = 1 # first check whether a special rule can be matched if curword in numCn or curword in numMath: maxlen = rules (sentence, start) # search for the longest word if curword in mydict: words = mydict [curword] for item in words: itemlen = len (item) if sentence [start: start + itemlen] = item and itemlen> maxlen: maxlen = itemlen result. append (sentence [start: start + maxlen]) start = start + maxlen return resultdef main (): args = sys. argv [1:] if len (args) <3: print 'usage: python dw. py dict_path test_path result_path 'exit (-1) dict_path = args [0] test_path = args [1] result_path = args [2] dicts = genDict (dict_path) fr = codecs. open (test_path, 'R', 'utf-8') test = fr. read () result = divideWords (dicts, test) fr. close () fw = codecs. open (result_path, 'w', 'utf-8') for item in result: fw. write (item + '') fw. close () if _ name _ = "_ main _": main ()

4. dw is used for testing and scoring results. test data of py training data. Generate a result file and use the score to calculate the score based on the training data, correct word splitting results, and the generated results. Use tail to view the overall score of the last few rows of the result file, in addition, socret. utf8 also provides a large number of comparison results, which can be used to find out where your word splitting results are not doing well. Note: The entire test process is completed in Ubuntu $ python dw. py pku_training.utf8 pku_test.utf8 pku_result.utf8 $ perl score pku_training.utf8 pku_test_gold.utf8 pku_result.utf8> score. utf8 $ tail-22 score. utf8 INSERTIONS: 0 DELETIONS: 0 SUBSTITUTIONS: 0 NCHANGE: 0 NTRUTH: 27 NTEST: 27 TRUE WORD S recall: 1.000 test words precision: 1.000 === SUMMARY: === total insertions: 4623 === total deletions: 1740 === total substitutions: 6650 = total nchange: 13013 = total true word count: 104372 = total test word count: 107255 = total true words recall: 0.920 = total test words precision: 0.895 = f measure: 0.907 = OOV Rate: 0.940 = OOV Recall Rate: 0.917 = IV Recall Rate: 0.966 dictionary-based FMM algorithms are very basic The basic word segmentation algorithm is not very effective, but it is simple enough and easy to start with. As I learn more, I may use Python to implement other word segmentation algorithms. Another feeling is that when you read a book, try to implement it as much as possible. This will allow you to have enough enthusiasm to focus on every detail of the theory and will not feel so boring.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Write a simple Chinese Word divider in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Write a simple Chinese Word divider in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support