To my understanding, the simplest Word segmentation procedure, should be the first Chinese text into the smallest unit-Chinese characters-and then look for words in the dictionary, these words in accordance with the most left the longest principle (and regular spirit), merged into a set of words. This should be the fastest, only according to the given data division merge, do not have to consider the weight of the grammatical elements (part of speech: The name of the number of moves, etc., grammar: the subject-predicate fixed complement), as well as the occurrence of the number of contexts.
On the source text of the segmentation, referring to the "Statistical Chinese characters/English words," a thought, the use of regular expression R "(? x) (?: [w-]+ | [X80-xff] {3}) to match.
As for the dictionary, I use the dictionary of Cc-cedict for three reasons: no copyright problem; it's faster; Chrome is using it too (see: Double-click the Chinese sentence on the chrome, it will automatically choose the Chinese word instead of the word or whole line to highlight).
Next is how to participle. After thinking, I found that the principle of the search tree can be used. Principle see this article: Trie in Python. The specific method is to read the thesaurus verbatim into memory, build a search tree, and then verbatim analysis of the target text, if the word can also be searched after, then continue to search, otherwise stop, as a vocabulary unit processing.
This algorithm is relatively fast in theory (without benchmark), there are three reasons: the use of trie structure, is essentially a hash table, space for time, is O (0) level of search, the thesaurus is only 800K, can easily load, memory space does not account for how much; the slowest part of the algorithm is loaded into the trie phase, Then the speed is no longer affected.
However, when it comes to extensibility, it is only possible to add new words manually in Words.txt, but not machine learning.
Source
The complete program (including the list of Thesaurus I've worked with) is on the GitHub. Interested can play a bit. The main program is listed here:
The code is as follows |
Copy Code |
#!/usr/bin/python #-*-Coding:utf-8-*- # #author: Rex #blog: http://iregex.org #filename nlp.py #created: 2010-09-26 19:15 Import re Import Sys Regex=re.compile (? x) (?: [w-]+ | [X80-xff] {3}) ") def init_wordslist (fn= "./words.txt"): F=open (FN) Lines=sorted (F.readlines ()) F.close () Return lines def words_2_trie (wordslist): d={} For word in wordslist: Ref=d Chars=regex.findall (Word) For Char in chars: Ref[char]=ref.has_key (char) and Ref[char] or {} Ref=ref[char] ref[']=1 Return D def search_in_trie (chars, trie): Ref=trie Index=0 For Char in chars: If Ref.has_key (char): Print Char, Ref=ref[char] Index+=1 Else If index==0: Index=1 Print Char, print ' * ', Try Chars=chars[index:] Search_in_trie (chars, trie) Except Pass Break def main (): #init Words=init_wordslist () Trie=words_2_trie (words) #read Content FN=SYS.ARGV[1] String=open (FN). Read () Chars=regex.findall (String)
#do the Job Search_in_trie (chars, trie) If __name__== ' __main__ ': Main () |
Native test
The text of the test is as follows:
Only to hear a woman low to a sound. Green Bamboo Weng Way: "Aunt please see, this music spectrum can be a little odd." "That
Woman again, a sound, piano sound sounded, tuned chord, stopped for a while, it seems to be broken in the strings, and changed the tune
Chord, and then played it up. At the beginning, the same as the Green bamboo Weng, to the later the higher the turn, the piano rhyme unexpectedly him seem valleys, weightlifting
If the light, effortless and then turn up. Make Fox Chong pleasantly surprised, vaguely remember is that night heard Couling played
of the piano rhyme. This song is sometimes impassioned, sometimes gentle and elegant, make Fox Chong although unknown music theory, but feel this mother-in-law played,
And Couling played with the same melody, but the charm is very different. The tune of this mother-in-law's melody is very calm, and it is listening to the sound.
The beauty of music, but there is no Couling of blood such as boiling stirring. Played a long, musical rhythm gradually slow, it seems that the music is not far away, pour
Like playing the piano Man out of dozens of feet away, and went to the number of miles outside, a few can not be heard.
Rational patriotism
Sex experience
I love regular expressions.
Please pay attention to the three lines at the end.
Take a look at the results of the program processing: (* Indicates the separation between words)
1
* * * * A * * * * * * * * * * * * * * * * * * * * * *. * "* The girl * * * * * * *, * * *, * *, * * *, * * * *, * * * *, * * * * *, * * * * * * *, * * * * * * *, * * * * *, * * * * * *, * The string * to go *, * and * * * * * * * * * * * * * *, * * * * * * * *. * * * * and * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *, * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * No EFFORT * * * * * * * * * * * * * * and * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *, * * * * *, * remember * * that night * * * * * * * * * * * * * * * * * * *. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *, * * * * * * *. * This * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LISTENING * * * * * * * * * * * * * * * * * Excited *. * * a long *, * * * *, * * *, * * * *, * * * *, * * * *, * * * *, * * * *, * * * * * *, * * * * * * * *, * * * * * * * *, * * * * * * * * *, * * * * * and * * * * * * * * * *, * * * *, * * * *, * * * * * * * * * * * * *, * * * Reason * patriotism * sex * experience * I * love * regular * expression
1, practical, can satisfy most of the network article participle need.
2, fast, participle process will not throw deadlineexceedederror errors.
3, low memory footprint, not because the memory footprint exceeds the limit and each instance is forced to kill after running once.
The original idea is: The word thesaurus sorted well in a list object, and then use Bisect library for quick Search. Because the bisect default is C implementation, so the matching speed is very fast, but the list object saved the thesaurus is too memory-intensive, loading speed is very slow. Completely unsuitable for use on Google App engine.
The solution is to store the words in different lengths of the word library in the use of the Str object, using the same bisect library to match the two-way library.
New Forum Series Introduction bis-Function Introduction Bulletin: CSDN Blog Channel blog moving function online! Java EE Rapid Development platform G4studio author Xun Chun interview
China's largest mobile developer high-level event without weight only quality: ipad Edition "Programmer's Magazine" Application On-line "first intimate Contact"--award-winning essay activity
python Chinese word--fmm algorithm.
categories: Various scripts including (python) data structures & algorithms 2009-06-23 12:04 2842 people read comments (2) Collection report
FMM algorithm of the simplest idea is to use the greedy algorithm to find N, if this n composition of the word in the dictionary appears, OK, if not appear, then find n-1 ... And then go on. If n words appear in the dictionary, keep looking from the n+1 position and know that the sentence is over.
The code is as follows |
Copy Code |
. import RE def preprocess (sentence,edcode= "Utf-8"): Sentence = Sentence.decode (edcode) Sentence=re.sub (u) [. ,,! ......! "" <>/"'::? /?,/| "" " ;] "," ", sentence) return sentence . def FMM (Sentence,diction,result = [],maxwordlength = 4,edcode= "Utf-8"): . i = 0 Sentence = Preprocess (sentence,edcode) length = len (sentence) While I < length: # Find the ASCII word Tempi=i TOK=SENTENCE[I:I+1] While Re.search ("[0-9a-za-z/-/+#@_/.] {1} ", Tok) <>none: I= i+1 TOK=SENTENCE[I:I+1] If i-tempi>0: Result.append (Sentence[tempi:i].lower (). Encode (Edcode)) # Find Chinese Word left = Len (sentence[i:]) If left = 1: "" "" "" "" "" "" "FMM" " "" Should we add the last one? Yes, if not blank "" " If Sentence[i:] <> "": Result.append (Sentence[i:].encode (Edcode)) return result m = min (left,maxwordlength)
For j in Xrange (m,0,-1): Leftword = Sentence[i:j+i].encode (Edcode) # print Leftword.decode (Edcode) If LookUp (leftword,diction): # Find the left word in dictionary # It ' s the right One i = J+i Result.append (Leftword) Break Elif J = = 1: "" Only one word, add to result, if not blank "" " If Leftword.decode (edcode) <> "": Result.append (Leftword) i = i+1 Else Continue return result def LookUp (word,dictionary): If Dictionary.has_key (word): Return True Return False def convertgbktoutf (sentence): Return Sentence.decode (' GBK '). Encode (' Utf-8 ') Import re def preprocess (sentence,edcode= "Utf-8"): Sentence = Sentence.decode (edcode) Sentence=re.sub (u) [. ,,! ......! "" <>/"'::? /?,/| "" " ;] "," ", sentence) return sentence def FMM (Sentence,diction,result = [],maxwordlength = 4,edcode= "Utf-8"): i = 0 Sentence = Preprocess (sentence,edcode) length = len (sentence) While I < length: # Find the ASCII word Tempi=i TOK=SENTENCE[I:I+1] While Re.search ("[0-9a-za-z/-/+#@_/.] {1} ", Tok) <>none: I= i+1 TOK=SENTENCE[I:I+1] If i-tempi>0: Result.append (Sentence[tempi:i].lower (). Encode (Edcode)) # Find Chinese Word left = Len (sentence[i:]) If left = 1: "" "" "" "" "" "" "FMM" " "" Should we add the last one? Yes, if not blank "" " If Sentence[i:] <> "": Result.append (Sentence[i:].encode (Edcode)) return result m = min (left,maxwordlength)
For j in Xrange (m,0,-1): Leftword = Sentence[i:j+i].encode (Edcode) # print Leftword.decode (Edcode) If LookUp (leftword,diction): # Find the left word in dictionary # It ' s the right One i = J+i Result.append (Leftword) Break Elif J = = 1: "" Only one word, add to result, if not blank "" " If Leftword.decode (edcode) <> "": Result.append (Leftword) i = i+1 Else Continue return result def LookUp (word,dictionary): If Dictionary.has_key (word): Return True Return False def convertgbktoutf (sentence): Return Sentence.decode (' GBK '). Encode (' Utf-8 ') |
Test code:
The code is as follows |
Copy Code |
[C-sharp] View plaincopyprint? . dictions = {} . dictions["AB"] = 1 . dictions["CD" = 2 . dictions["ABC" = 1 . dictions["ss"] = 1 . Dictions[convertgbktoutf ("good")] = 1 . Dictions[convertgbktoutf ("really") = 1 Sentence = "Asdfa good is this? VASDIW DAF DASFIW Asid is it? " . S = FMM (Convertgbktoutf (sentence), dictions) . For I in S: . Print I.decode ("Utf-8") Dictions = {} dictions["AB"] = 1 dictions["CD" = 2 dictions["ABC" = 1 dictions["ss"] = 1 Dictions[convertgbktoutf ("good")] = 1 Dictions[convertgbktoutf ("true")] = 1 Sentence = "Asdfa good is this true vasdiw DAF DASFIW asid is it?" " s = FMM (Convertgbktoutf (sentence), dictions) For I in S: Print I.decode ("Utf-8") Text Test code: [C-sharp] View plaincopyprint? . Test = open ("Test.txt", "R") . For line in test: . s = FMM (Covertgbktoutf (line), dictions) . For I in S: . Print I.decode ("Utf-8") |