Chinese word segmentation--maximum forward matching algorithm Python implementation

Last Update:2018-08-01 Source: Internet

Author: User

Tags readfile

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Maximum matching method: The maximum matching refers to the dictionary as the basis, take the longest word in the dictionary for the first time to take the number of scan string, in the dictionary to scan (in order to improve the efficiency of the scan, you can also follow the number of words to design multiple dictionaries, and then according to the words from different dictionaries to scan). For example: The longest word in the dictionary is "People's Republic of China" with a total of 7 Chinese characters, the maximum match starting Word is 7 characters. It is then decremented verbatim, and is searched in the corresponding dictionary.

The following is an example of "We are playing in the safari park" for a detailed explanation of the forward and inverse maximum matching methods:

1. Forward maximum matching method:

The positive direction is to take the word from the 7->1, minus one word at a time, until the dictionary hits or leaves 1 words.

1th time: "We are in the wild", scanning a 7-word dictionary, without

2nd time: "We are in the wild", scanning 6 word dictionaries, no

。。。。

6th time: "We", scan 2 word dictionaries, have

The scan was aborted, the 1th word was "we", the 1th word was removed and the 2nd round scan was started, namely:

2nd round Scan:

1th time: "Play in Safari", scan 7 word dictionaries, no

2nd time: "In Safari", scan 6 word dictionaries, none

。。。。

6th time: "In opposition", scan 2 word dictionaries, have

The scan was aborted, the 2nd word was "in opposition", the 2nd word was removed and the 3rd round scan was started, namely:

3rd round Scan:

1th time: "Live Zoo", scan 5-word dictionary, no

2nd time: "Sheng Zoo", scan 4 words dictionary, no

3rd time: "Raw Animal", scan 3 word dictionary, no

4th time: "Vivid", scan 2 word dictionaries, have

The scan is aborted, the 3rd word is "vivid", the 4th scan, namely:

4th round Scan:

1th time: "Park play", scan 3 words dictionary, no

2nd time: "Object Garden", scan 2 words dictionary, no

3rd time: "Object", scan 1-word dictionary, no

Scan aborted, the output of the 4th word is "thing", non-dictionary word number plus 1, start the 5th round scan, namely:

5th round Scan:

1th time: "Park play", scan 2 words dictionary, no

2nd time: "Park", scan 1 word dictionaries, have

Scan abort, the output of the 5th word is "garden", the number of Word dictionary words plus 1, start the 6th round scan, namely:

6th round Scan:

1th time: "Play", scan 1 word dictionary words, have

Scan abort, the output of the 6th word "Play", the number of Word dictionary words plus 1, the overall scan end.

The positive maximum matching method, the final segmentation result is: "we/opposition/vivid/object/garden/play"

2. Python Code implementation

1 #-*-coding:utf-8-*-2 """3 Created on Thu Jul 08:57:56 20184 5 @author: Lenovo6 """7 8Test_file ='Train/train.txt'#Training Corpus9Test_file2 ='Test/test.txt'#Test CorpusTenTest_file3 ='Test_sc/test_sc_zhengxiang.txt'#Generate Results One  A defGet_dic (Test_file):#Read text return list -With open (Test_file,'R', encoding='Utf-8',) as F: -         Try: theFile_content =F.read (). Split () -         finally: - f.close () -chars =list (set (file_content)) +     returnchars -  +DIC =get_dic (test_file) A defReadFile (test_file2): atMax_length = 5 -      -h = open (Test_file3,'W', encoding='Utf-8',)  -With open (Test_file2,'R', encoding='Utf-8',) as F: -Lines =F.readlines () -  in      forLineinchLines#forward maximum matching for each row, respectively -Max_length = 5 toMy_list = [] +Len_hang =Len (line) -          whileLen_hang>0: theTryword =Line[0:max_length] *              whileTryword not inchDIC: $                 ifLen (Tryword) ==1:Panax Notoginseng                      Break -Tryword=tryword[0:len (Tryword)-1] the my_list.append (Tryword) +line =Line[len (Tryword):] ALen_hang =Len (line) the          +          forTinchMy_list:#write the result of the word breaker to the makefile -             ift = ='\ n' : $H.write ('\ n') $             Else: -H.write (t +"  ") -      the h.close () -         WuyiReadFile (Test_file2)

3, the training corpus and the test corpus see the attachment.

Chinese Word segmentation--maximum forward matching algorithm Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More