Count the word count of historical records and the word count of historical records

Source: Internet
Author: User

Count the word count of historical records and the word count of historical records

I haven't studied for a long time. I suddenly want to read the history to improve my strength. Then, I downloaded the full text of historical records online.

I don't know if it is a complete set, so I had a whimsy and wanted to count the total number of words to see if it was basically complete. OK. You can use many languages, such as C #, PHP, JAVA, C ++, PYTHON, and VB. Which one is used? Python is just getting used to it. It's fresh. Unexpectedly, I suddenly fell into the trap. (If God gave me another chance to choose, I would definitely choose Java. I didn't expect python 2.7 to handle Chinese characters very hard ). I decided to use python. Select a version first. Which version is installed on the computer with python 3.5 and python 2.7? To be honest, 3.5 is used once and crawlers need it. Usually, 2.7 of the data is used, and small scientific calculations are often used. Use 2.7 (just step by step ). OK. Start to write the code. The first step is to kneel down... : Woday. What is the problem? I don't want to check whether the file list works. Hurry up and ask du Niang. I understand that Chinese characters cannot appear in the path. unicode is required if Chinese characters appear. I can't help it. After all, when I write a web file, I also encountered it when I was processing and downloading a Chinese name file. So let's change it: OK, no problem, no error. We just added unicode (originpath, "UTF-8 "). we can enumerate the files. Next we will traverse each file one by one: it looks okay, And we can read it all. Then we need to determine whether it is a symbol or a Chinese character. If you know unicode, you can use the following method to determine whether it is a Chinese character:
  1. # Determining whether it is a Chinese character
  2. def is_chinese(uchar):
  3. if uchar >= u'\u4E00'and uchar <= u'\u9FA5':
  4. returnTrue
  5. else:
  6. returnFalse
The above is a function used to determine whether it is a Chinese character. We need to add a global variable to the program and to count the number of words. The global variables in python are somewhat strange and cannot be initialized. Therefore, the code is further improved as follows: I tested the code and found it was garbled. Where can I find the error? F is a string directly read from a file or a Chinese character. It must be Unicode for processing. Change the code: My God. No error is reported this time, but I didn't print out the Chinese characters. I printed a single character to break the AK. Why? I want to change it to five characters and try: It's strange to print it out. What does it mean that the first character is not Chinese? Take a look at the file: Chinese characters? Why? Is it a hateful BOM header?
  • # Remove BOM header information
  • def cut_bom(f):
  • if f[:3]== codecs.BOM_UTF8:
  • f = f[3:]
  • return f
  • As shown above, we have removed the BOM header information. After this is done, we only need to make the final statistics: Perfect. Finally, we can find that there are more than 0.51 million Chinese characters. Baidu's historical records are described as follows: more than 10 thousand words are queried. Basically. After all, historical records have been deleted and modified, and many things have been added to future generations. These 0.5 million words are enough for me to read. It is not difficult to count the number of Chinese characters. If we use Java to estimate the number of Chinese characters we will write in 10 minutes, we will basically understand the principle of using python to waste time on garbled code processing. Read Unicode in Python to get a deeper understanding.

    From Weizhi note (Wiz)

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.