Chinese word segmentation--maximum forward matching algorithm Python implementation

Source: Internet
Author: User
Tags readfile

Maximum matching method: The maximum matching refers to the dictionary as the basis, take the longest word in the dictionary for the first time to take the number of scan string, in the dictionary to scan (in order to improve the efficiency of the scan, you can also follow the number of words to design multiple dictionaries, and then according to the words from different dictionaries to scan). For example: The longest word in the dictionary is "People's Republic of China" with a total of 7 Chinese characters, the maximum match starting Word is 7 characters. It is then decremented verbatim, and is searched in the corresponding dictionary.

The following is an example of "We are playing in the safari park" for a detailed explanation of the forward and inverse maximum matching methods:

1. Forward maximum matching method:

The positive direction is to take the word from the 7->1, minus one word at a time, until the dictionary hits or leaves 1 words.

1th time: "We are in the wild", scanning a 7-word dictionary, without

2nd time: "We are in the wild", scanning 6 word dictionaries, no

。。。。

6th time: "We", scan 2 word dictionaries, have

The scan was aborted, the 1th word was "we", the 1th word was removed and the 2nd round scan was started, namely:

2nd round Scan:

1th time: "Play in Safari", scan 7 word dictionaries, no

2nd time: "In Safari", scan 6 word dictionaries, none

。。。。

6th time: "In opposition", scan 2 word dictionaries, have

The scan was aborted, the 2nd word was "in opposition", the 2nd word was removed and the 3rd round scan was started, namely:

3rd round Scan:

1th time: "Live Zoo", scan 5-word dictionary, no

2nd time: "Sheng Zoo", scan 4 words dictionary, no

3rd time: "Raw Animal", scan 3 word dictionary, no

4th time: "Vivid", scan 2 word dictionaries, have

The scan is aborted, the 3rd word is "vivid", the 4th scan, namely:

4th round Scan:

1th time: "Park play", scan 3 words dictionary, no

2nd time: "Object Garden", scan 2 words dictionary, no

3rd time: "Object", scan 1-word dictionary, no

Scan aborted, the output of the 4th word is "thing", non-dictionary word number plus 1, start the 5th round scan, namely:

5th round Scan:

1th time: "Park play", scan 2 words dictionary, no

2nd time: "Park", scan 1 word dictionaries, have

Scan abort, the output of the 5th word is "garden", the number of Word dictionary words plus 1, start the 6th round scan, namely:

6th round Scan:

1th time: "Play", scan 1 word dictionary words, have

Scan abort, the output of the 6th word "Play", the number of Word dictionary words plus 1, the overall scan end.

The positive maximum matching method, the final segmentation result is: "we/opposition/vivid/object/garden/play"

2. Python Code implementation

1 #-*-coding:utf-8-*-2 """3 Created on Thu Jul 08:57:56 20184 5 @author: Lenovo6 """7 8Test_file ='Train/train.txt'#Training Corpus9Test_file2 ='Test/test.txt'#Test CorpusTenTest_file3 ='Test_sc/test_sc_zhengxiang.txt'#Generate Results One  A defGet_dic (Test_file):#Read text return list -With open (Test_file,'R', encoding='Utf-8',) as F: -         Try: theFile_content =F.read (). Split () -         finally: - f.close () -chars =list (set (file_content)) +     returnchars -  +DIC =get_dic (test_file) A defReadFile (test_file2): atMax_length = 5 -      -h = open (Test_file3,'W', encoding='Utf-8',)  -With open (Test_file2,'R', encoding='Utf-8',) as F: -Lines =F.readlines () -  in      forLineinchLines#forward maximum matching for each row, respectively -Max_length = 5 toMy_list = [] +Len_hang =Len (line) -          whileLen_hang>0: theTryword =Line[0:max_length] *              whileTryword not inchDIC: $                 ifLen (Tryword) ==1:Panax Notoginseng                      Break -Tryword=tryword[0:len (Tryword)-1] the my_list.append (Tryword) +line =Line[len (Tryword):] ALen_hang =Len (line) the          +          forTinchMy_list:#write the result of the word breaker to the makefile -             ift = ='\ n' : $H.write ('\ n') $             Else: -H.write (t +"  ") -      the h.close () -         WuyiReadFile (Test_file2)

3, the training corpus and the test corpus see the attachment.

Chinese Word segmentation--maximum forward matching algorithm Python implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.