Count the words of the historical records

Last Update:2016-03-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Long time no reading, suddenly want to read a history to improve their own force lattice. Then on the Internet to download the original text of the historical records.

Because do not know is not complete, so whim, want to count the total number of words, see is not the basic complete. OK, say dry, can use a lot of languages, C #, PHP, JAVA, C + +, PYTHON, VB, can be. Which one? Python has just come in contact, very fresh, just you. Unexpectedly, suddenly fell into the pit (if God gave me another chance to choose, I must choose Java, did not expect Python 2.7 processing Chinese really laborious). decided to use Python, first choose a version of it, the computer installed 3.5 and 2.7, which one? To tell you the truth, 3.5 is used once, reptiles need. Usually use more or 2.7, often take to do some small scientific calculations. Then use 2.7 (just one step down the hole). OK, start writing code, the first step on the kneeling ... : Rue de Vaugirard, what is the situation, want to see the file list is not. Hurriedly asked degree Niang, understand, originally is the path can not appear Chinese, appear Chinese words need Unicode a bit. I endure, after all, when I write the web, the process of downloading Chinese name files also encountered. So change: OK, no problem, no error, we are adding a Unicode (Originpath, "Utf-8"). Can be listed in the file, then we are walking through each of the files: look no problem, can read them all. Then we need to judge whether it is a symbol or a Chinese character ah, ask the Niang, know the Unicode words, determine if the Chinese characters can be by the following way:

#判断是否为汉字
def is_chinese(uchar):
if uchar >= u‘\u4E00‘and uchar <= u‘\u9FA5‘:
returnTrue
else:
returnFalse

As above, determine whether the function is a Chinese character. We add to the program, and in order to count the number of words, we also need a global variable, Python inside the global variable is somewhat strange, cannot initialize. Therefore, the code is further improved as follows: As above, testing a bit, found that even garbled, think wrong where? F is directly from the file read out of the string, or Chinese, it must be Unicode to handle ah, quickly change code: My God, this time no error, but unexpectedly did not print out the characters, I printed a word characters break, ah, how? I changed to print five characters try: Can print out, very strange, this means the first character is not Chinese? Quick look at the document: is the Chinese character ah? What's the reason? Is it a damn BOM head? So, it is the BOM head of the ghost. Quickly add code to remove BOM header:

#去除BOM header information

def Cut_bom ( f

if f [: 3 == codecs bom_utf8 :
f = F [ 3 :]
return f

As shown above, there is something, we have removed the BOM header information. Take care of these, only the last step of statistics: perfect, the final statistics of the number of Chinese characters a total of more than 510,000. Baidu a bit of the historical records of the word description as follows: 10,000 more words to check it. Basically, almost. After all, the historical records are censored, posterity also added a lot of things. These 500,000 words are enough for me to read. Statistics on the number of Chinese characters is not difficult, if the Java estimated 10 minutes to write, with Python mainly wasted time on garbled processing, now basically understand the principle of this. Or read the correct use of Unicode in Python for a more profound understanding.

From for notes (Wiz)

Count the words of the historical records

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Count the words of the historical records

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Count the words of the historical records

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support