Count the words of the historical records

Source: Internet
Author: User

Long time no reading, suddenly want to read a history to improve their own force lattice. Then on the Internet to download the original text of the historical records.

Because do not know is not complete, so whim, want to count the total number of words, see is not the basic complete. OK, say dry, can use a lot of languages, C #, PHP, JAVA, C + +, PYTHON, VB, can be. Which one? Python has just come in contact, very fresh, just you. Unexpectedly, suddenly fell into the pit (if God gave me another chance to choose, I must choose Java, did not expect Python 2.7 processing Chinese really laborious). decided to use Python, first choose a version of it, the computer installed 3.5 and 2.7, which one? To tell you the truth, 3.5 is used once, reptiles need. Usually use more or 2.7, often take to do some small scientific calculations. Then use 2.7 (just one step down the hole). OK, start writing code, the first step on the kneeling ... : Rue de Vaugirard, what is the situation, want to see the file list is not. Hurriedly asked degree Niang, understand, originally is the path can not appear Chinese, appear Chinese words need Unicode a bit. I endure, after all, when I write the web, the process of downloading Chinese name files also encountered. So change: OK, no problem, no error, we are adding a Unicode (Originpath, "Utf-8"). Can be listed in the file, then we are walking through each of the files: look no problem, can read them all. Then we need to judge whether it is a symbol or a Chinese character ah, ask the Niang, know the Unicode words, determine if the Chinese characters can be by the following way:
  1. #判断是否为汉字
  2. def is_chinese(uchar):
  3. if uchar >= u‘\u4E00‘and uchar <= u‘\u9FA5‘:
  4. returnTrue
  5. else:
  6. returnFalse
As above, determine whether the function is a Chinese character. We add to the program, and in order to count the number of words, we also need a global variable, Python inside the global variable is somewhat strange, cannot initialize. Therefore, the code is further improved as follows: As above, testing a bit, found that even garbled, think wrong where? F   is directly from the file read out of the string, or Chinese, it must be Unicode to handle ah, quickly change code: My God, this time no error, but unexpectedly did not print out the characters, I printed a word characters break, ah, how? I changed to print five characters try: Can print out, very strange, this means the first character is not Chinese? Quick look at the document: is the Chinese character ah? What's the reason? Is it a damn BOM head?     So, it is the BOM head of the ghost. Quickly add code to remove BOM header:
    1. #去除BOM header information
    2. Li class= "L1" > def Cut_bom ( f
    3. if f [: 3 == codecs bom_utf8 :
    4. f = F [ 3 :]
    5. return f
As shown above, there is something, we have removed the BOM header information. Take care of these, only the last step of statistics: perfect, the final statistics of the number of Chinese characters a total of more than 510,000. Baidu a bit of the historical records of the word description as follows: 10,000 more words to check it. Basically, almost. After all, the historical records are censored, posterity also added a lot of things. These 500,000 words are enough for me to read. Statistics on the number of Chinese characters is not difficult, if the Java estimated 10 minutes to write, with Python mainly wasted time on garbled processing, now basically understand the principle of this. Or read the correct use of Unicode in Python for a more profound understanding.



From for notes (Wiz)



Count the words of the historical records

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.