The Chinese characters read from common files, such as txt, are encoded in GBK format. However, I do not know the encoding after decode ('gbk. But it should be a unicode code.
I don't know if there is any good processing method. However, Chinese Word Segmentation must start with sentence segmentation. As a result, I use string. maketrans () or Re. sub ()..., but they do not have the effect of converting the 7th percentile to a space. The final result may be due to Encoding Problems. Then, we use the stupid method for clause. One character is read and the corresponding clause is read. In addition, if it is GBK encoding, decode ('gbk') is required, and cannot be encoded as UTF-8, or use the past GBK to encode gb2312. Otherwise, the sentences you split will be garbled. I don't know why. Below is:
Def cut (cutlist, lines): <br/> L = [] <br/> line = [] </P> <p> for I in lines: <br/> If findtok (cutlist, I): <br/> L. append ("". join (line) <br/> L. append (I) <br/> line = [] <br/> else: <br/> line. append (I) <br/> return l
Then read the file in rows and split the behavior sentence. The above returned results contain punctuation marks. Punctuation marks are stored separately. And the result may contain spaces.
Cutlist = "[.,,!......! "<> /"'::? /? ,/| "" '';] {} () {} [] () {} ():?!,; ,~ -- + % ': ""'/N/R ". decode ('gbk') <br/> for lines in file (inputfilename): <br/> L = cut (List (cutlist), list (lines. decode ('gbk') <br/> for line in L: <br/> If line. strip () <> "": # It may contain spaces <br/> li = line. strip (). split () <br/> for sentence in Li: <br/> Print "SE:", sentence
CopyCodePay attention to the format. In any case, the Chinese clause is finally completed. You can reduce or add cutlist as needed.