1. Because the example in <beautiful data> does not have a Chinese corpus, it is replaced by an English string. The same idea (such as dividing finallylast into ['Finally ', 'lala']
2. Code segmentation Module
Code
Import operator
Def segment (text ):
"Return a list of words that is the best segmentation of text ."
If not text: return []
Candidates = ([first] + segment (rem) for first, rem in splits (text ))
Return max (candidates, key = Pwords)
Def splits (text, L = 20 ):
"Return a list of all possible (first, rem) pairs, len (first) <= L ."
Return [(text [: I + 1], text [I + 1:])
For I in range (min (len (text), L)]
Def Pwords (words ):
"The Naive Bayes probability of a sequence of words ."
Return product (Pw (w) for w in words)
Def product (nums ):
"Return the product of a sequence of numbers ."
Return reduce (operator. mul, nums, 1)
Class Pdist (dict ):
"A probability distribution estimated from counts in datafile ."
Def _ init _ (self, data = [], N = None, missingfn = None ):
For key, count in data:
Self [key] = self. get (key, 0) + int (count)
Self. N = float (N or sum (self. itervalues ()))
Self. missingfn = missingfn or (lambda k, N: 1./N)
Def _ call _ (self, key ):
If key in self: return self [key]/self. N
Else: return self. missingfn (key, self. N)
Def datafile (name, sep = '\ t '):
"Read key, value pairs from file ."
For line in file (name ):
Yield line. split (sep)
Def avoid_long_words (key, N ):
"Estimate the probability of an unknown word ."
Return 10./(N * 10 ** len (key ))
N = 1024908267229 # Number of tokens
Pw = Pdist (datafile (r 'C: \ Python26 \ Myngrams \ count_11_txt '), N, avoid_long_words)
2. Note: add an empty _ init _. py in Myngrams.
3. Verify
From Myngrams import Mysegment
Mysegment. segment ('finallyla ')
['Finally ', 'last']
Mysegment. segment ('unregardedsitlow ')
['UN', 'regarded', 'sitlow']
The training corpus does not contain the unregarded word. When sitdown is used as the probability of a word, P (sit) P (down) is incorrect. Binary syntax Word Segmentation