Maximum matching method: The maximum matching refers to the dictionary as the basis, take the longest word in the dictionary for the first time to take the number of scan string, in the dictionary to scan (in order to improve the efficiency of the scan, you can also follow the number of words to design multiple dictionaries, and then according to the words from different dictionaries to scan). For example: The longest word in the dictionary is "People's Republic of China" with a total of 7 Chinese characters, the maximum match starting Word is 7 characters. It is then decremented verbatim, and is searched in the corresponding dictionary.
The following is an example of "We are playing in the safari park" for a detailed explanation of the forward and inverse maximum matching methods:
1. Forward maximum matching method:
The positive direction is to take the word from the 7->1, minus one word at a time, until the dictionary hits or leaves 1 words.
1th time: "We are in the wild", scanning a 7-word dictionary, without
2nd time: "We are in the wild", scanning 6 word dictionaries, no
。。。。
6th time: "We", scan 2 word dictionaries, have
The scan was aborted, the 1th word was "we", the 1th word was removed and the 2nd round scan was started, namely:
2nd round Scan:
1th time: "Play in Safari", scan 7 word dictionaries, no
2nd time: "In Safari", scan 6 word dictionaries, none
。。。。
6th time: "In opposition", scan 2 word dictionaries, have
The scan was aborted, the 2nd word was "in opposition", the 2nd word was removed and the 3rd round scan was started, namely:
3rd round Scan:
1th time: "Live Zoo", scan 5-word dictionary, no
2nd time: "Sheng Zoo", scan 4 words dictionary, no
3rd time: "Raw Animal", scan 3 word dictionary, no
4th time: "Vivid", scan 2 word dictionaries, have
The scan is aborted, the 3rd word is "vivid", the 4th scan, namely:
4th round Scan:
1th time: "Park play", scan 3 words dictionary, no
2nd time: "Object Garden", scan 2 words dictionary, no
3rd time: "Object", scan 1-word dictionary, no
Scan aborted, the output of the 4th word is "thing", non-dictionary word number plus 1, start the 5th round scan, namely:
5th round Scan:
1th time: "Park play", scan 2 words dictionary, no
2nd time: "Park", scan 1 word dictionaries, have
Scan abort, the output of the 5th word is "garden", the number of Word dictionary words plus 1, start the 6th round scan, namely:
6th round Scan:
1th time: "Play", scan 1 word dictionary words, have
Scan abort, the output of the 6th word "Play", the number of Word dictionary words plus 1, the overall scan end.
The positive maximum matching method, the final segmentation result is: "we/opposition/vivid/object/garden/play"
2. Python Code implementation
1 #-*-coding:utf-8-*-2 """3 Created on Thu Jul 08:57:56 20184 5 @author: Lenovo6 """7 8Test_file ='Train/train.txt'#Training Corpus9Test_file2 ='Test/test.txt'#Test CorpusTenTest_file3 ='Test_sc/test_sc_zhengxiang.txt'#Generate Results One A defGet_dic (Test_file):#Read text return list -With open (Test_file,'R', encoding='Utf-8',) as F: - Try: theFile_content =F.read (). Split () - finally: - f.close () -chars =list (set (file_content)) + returnchars - +DIC =get_dic (test_file) A defReadFile (test_file2): atMax_length = 5 - -h = open (Test_file3,'W', encoding='Utf-8',) -With open (Test_file2,'R', encoding='Utf-8',) as F: -Lines =F.readlines () - in forLineinchLines#forward maximum matching for each row, respectively -Max_length = 5 toMy_list = [] +Len_hang =Len (line) - whileLen_hang>0: theTryword =Line[0:max_length] * whileTryword not inchDIC: $ ifLen (Tryword) ==1:Panax Notoginseng Break -Tryword=tryword[0:len (Tryword)-1] the my_list.append (Tryword) +line =Line[len (Tryword):] ALen_hang =Len (line) the + forTinchMy_list:#write the result of the word breaker to the makefile - ift = ='\ n' : $H.write ('\ n') $ Else: -H.write (t +" ") - the h.close () - WuyiReadFile (Test_file2)
3, the training corpus and the test corpus see the attachment.
Chinese Word segmentation--maximum forward matching algorithm Python implementation