Extract the following files: training data: icwb2-data/training/pku _ training. utf8 test data: icwb2-data/testing/pku _ test. utf8 correct word segmentation: icwb2-data/gold/pku _ test _ gold. utf8 scoring tool: icwb2-data/script/socre2 Algorithm Description algorithm is the simplest forward maximum matching (FMM): generate a dictionary with training data to scan test data from left to right, when a longest word is encountered, it is split until the sentence ends. Note: This is the initial algorithm. The code can be controlled within 60 rows, later, the test results showed that the number problem was not well handled, and the processing of the number was added. 3 source code and comments
#! /Usr/bin/env python #-*-coding: UTF-8-*-# Author: minix # Date: 2013-03-20 # Email: minix007@foxmail.comimport codecsimport sys # some special symbols processed by rules numMath = [u '0', u '1', u '2', u '3', u '4 ', u'5', u'6', u'7', u'8', u'9'] numMath_suffix = [U '. ', U' %', u '100 ', u 'wan', u 'kil', u '100', u '10 ', u'and'] numCn = [u'1', u'2', u'3', u'4', u'5', u'6 ', u '7', u '8', u '9', u '〇 ', u '0'] numCn_suffix_date = [u 'Year', u 'month ', u 'day'] numCn_suffix_unit = [u '100 ', u 'wan', u 'kil', u '100', u '10 ', u's'] special_char = [U' (', U')'] def proc_num_math (line, start ): "processing mathematical symbols in sentences" oldstart = start while line [start] in numMath or line [start] in numMath_suffix: start = start + 1 if line [start] in numCn_suffix_date: start = start + 1 return start-oldstartdef proc_num_cn (line, start ): "Processing Chinese numbers in sentences" oldstart = start while line [start] in numCn or line [start] in numCn_suffix_unit: start = start + 1 if line [start] in numCn_suffix_date: start = start + 1 return start-oldstartdef rules (line, start ): "handling special rules" if line [start] in numMath: return proc_num_math (line, start) elif line [start] in numCn: return proc_num_cn (line, start) def genDict (path): "Get Dictionary" f = codecs. open (path, 'R', 'utf-8') contents = f. read () contents = contents. replace (U' \ R', u'') contents = contents. replace (U' \ n', u'') # Separate the file content by spaces. mydict = contents. split (u'') # Remove duplicate newdict = List (set (mydict) newdict in the dictionary list. remove (u'') # create a dictionary # key is the first word, and value is the List truedict = {} for item in newdict: if len (item)> 0 and item [0] in truedict: value = truedict [item [0] value. append (item) truedict [item [0] = value else: truedict [item [0] = [item] return truedictdef print_unicode_list (uni_list): for item in uni_list: print item, def divideWords (mydict, sentence, cut it down until the sentence is closed. "" ruleChar = [] ruleChar. extend (numCn) ruleChar. extend (numMath) result = [] start = 0 senlen = len (sentence) while start <senlen: curword = sentence [start] maxlen = 1 # first check whether a special rule can be matched if curword in numCn or curword in numMath: maxlen = rules (sentence, start) # search for the longest word if curword in mydict: words = mydict [curword] for item in words: itemlen = len (item) if sentence [start: start + itemlen] = item and itemlen> maxlen: maxlen = itemlen result. append (sentence [start: start + maxlen]) start = start + maxlen return resultdef main (): args = sys. argv [1:] if len (args) <3: print 'usage: python dw. py dict_path test_path result_path 'exit (-1) dict_path = args [0] test_path = args [1] result_path = args [2] dicts = genDict (dict_path) fr = codecs. open (test_path, 'R', 'utf-8') test = fr. read () result = divideWords (dicts, test) fr. close () fw = codecs. open (result_path, 'w', 'utf-8') for item in result: fw. write (item + '') fw. close () if _ name _ = "_ main _": main ()
4. dw is used for testing and scoring results. test data of py training data. Generate a result file and use the score to calculate the score based on the training data, correct word splitting results, and the generated results. Use tail to view the overall score of the last few rows of the result file, in addition, socret. utf8 also provides a large number of comparison results, which can be used to find out where your word splitting results are not doing well. Note: The entire test process is completed in Ubuntu $ python dw. py pku_training.utf8 pku_test.utf8 pku_result.utf8 $ perl score pku_training.utf8 pku_test_gold.utf8 pku_result.utf8> score. utf8 $ tail-22 score. utf8 INSERTIONS: 0 DELETIONS: 0 SUBSTITUTIONS: 0 NCHANGE: 0 NTRUTH: 27 NTEST: 27 TRUE WORD S recall: 1.000 test words precision: 1.000 === SUMMARY: === total insertions: 4623 === total deletions: 1740 === total substitutions: 6650 = total nchange: 13013 = total true word count: 104372 = total test word count: 107255 = total true words recall: 0.920 = total test words precision: 0.895 = f measure: 0.907 = OOV Rate: 0.940 = OOV Recall Rate: 0.917 = IV Recall Rate: 0.966 dictionary-based FMM algorithms are very basic The basic word segmentation algorithm is not very effective, but it is simple enough and easy to start with. As I learn more, I may use Python to implement other word segmentation algorithms. Another feeling is that when you read a book, try to implement it as much as possible. This will allow you to have enough enthusiasm to focus on every detail of the theory and will not feel so boring.