In this paper, we describe the FMM algorithm of Python for Chinese word segmentation. Share to everyone for your reference. The specific analysis is as follows:
The simplest idea of the FMM algorithm is to use the greedy algorithm to find N, if this n composition of the word in the dictionary appears, OK, if not appear, then find n-1 ... And then go on. If n words appear in the dictionary, then continue searching from the n+1 position until the end of the sentence.
Import re def preprocess (sentence,edcode= "Utf-8"): sentence = Sentence.decode (edcode) sentence=re.sub (U "[. ,,! ......! "<>\" '::? \?, \| "" " ;] "," ", sentence) return sentence def FMM (Sentence,diction,result = [],maxwordlength = 4,edcode=" Utf-8 "): i = 0 sen Tence = preprocess (sentence,edcode) length = len (sentence) while I < length: # Find the ASCII word tempi =i tok=sentence[i:i+1] While Re.search ("[0-9a-za-z\-\+#@_\.] {1} ", Tok) <>none:i= i+1 tok=sentence[i:i+1] if I-tempi>0:result.append (sentence[tempi:i ].lower (). Encode (Edcode)) # find Chinese Word left = Len (sentence[i:]) If left = = 1: "" "Go to 4 step Over the FMM, "" "" "should we add the last one? Yes, if not blank "" "If Sentence[i:] <>" ": Result.append (Sentence[i:].encode (Edcode)) return Result m = min (left,maxwordlength) for J in Xrange (m,0,-1): Leftword = Sentence[i:j+i].encode (Edcode) # print Leftword.decode (edcode) if LookUp (leftword,diction): # Find the Left word in dictionary # It ' s the right one i = J+i result.append (leftword) break Elif J = = 1: "" "Only One word, add to result, if not blank "" "If Leftword.decode (edcode) <>" ": Result.append (LEFTW ORD) i = i+1 else:continue return result def LookUp (word,dictionary): If Dictionary.has_key (word): return True return False def convertgbktoutf (sentence): Return Sentence.decode (' GBK '). Encode (' Utf-8 ') dic tions = {} dictions["ab"] = 1 dictions["CD"] = 2 dictions["abc"] = 1 dictions["ss"] = 1 Dictions[convertgbktoutf ("good")] = 1 Dictions[convertgbktoutf ("true")] = 1 sentence = "Asdfa Okay, is that so? vasdiw Yes really daf DASFIW asid yes? "s = FMM (Convertgbktoutf (sentence), dictions) for I in S:print i.decode (" utf-8 ") test = open (" Test.txt "," R ") for line In test:s = FMM (Covertgbktoutf (line), dictions) foR i in S:print i.decode ("Utf-8")
The results of the operation are as follows:
Asdfa
Good
Is
This
Sample
?
Vasdiw
Yes
It's true
Daf
Dasfiw
Asid
Is
?
?
Hopefully this article will help you with Python programming.