How to solve Python problems in Chinese-clause

Source: Internet
Author: User

The Chinese characters read from common files, such as txt, are encoded in GBK format. However, I do not know the encoding after decode ('gbk. But it should be a unicode code.

I don't know if there is any good processing method. However, Chinese Word Segmentation must start with sentence segmentation. As a result, I use string. maketrans () or Re. sub ()..., but they do not have the effect of converting the 7th percentile to a space. The final result may be due to Encoding Problems. Then, we use the stupid method for clause. One character is read and the corresponding clause is read. In addition, if it is GBK encoding, decode ('gbk') is required, and cannot be encoded as UTF-8, or use the past GBK to encode gb2312. Otherwise, the sentences you split will be garbled. I don't know why. Below is:

Def cut (cutlist, lines): <br/> L = [] <br/> line = [] </P> <p> for I in lines: <br/> If findtok (cutlist, I): <br/> L. append ("". join (line) <br/> L. append (I) <br/> line = [] <br/> else: <br/> line. append (I) <br/> return l

 

Then read the file in rows and split the behavior sentence. The above returned results contain punctuation marks. Punctuation marks are stored separately. And the result may contain spaces.

Cutlist = "[.,,!......! "<> /"'::? /? ,/| "" '';] {} () {} [] () {} ():?!,; ,~ -- + % ': ""'/N/R ". decode ('gbk') <br/> for lines in file (inputfilename): <br/> L = cut (List (cutlist), list (lines. decode ('gbk') <br/> for line in L: <br/> If line. strip () <> "": # It may contain spaces <br/> li = line. strip (). split () <br/> for sentence in Li: <br/> Print "SE:", sentence

CopyCodePay attention to the format. In any case, the Chinese clause is finally completed. You can reduce or add cutlist as needed.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.