Use BeautifulSoup to crawl "0daydown" website information (2) -- solve the character encoding problem, beautifulsoup

Source: Internet
Author: User

Use BeautifulSoup to crawl "0daydown" website information (2) -- solve the character encoding problem, beautifulsoup

The program in the previous article captures the latest 10 pages of 0daydown, and the output is directly output to the console. When I improve the code again, I plan to write them into a TXT file. This is the problem.

At first, my code was as follows:

#-*-Coding: UTF-8-*-# ----------------------------------- # version: 0.1 # note: The latest 10-page resource of 0daydown is found. # Outputs # ----------------------------------- # version: 0.2 # note: Output content to a specified TXT file based on v0.1 # ------------------------------------- import urllib. requestimport sysimport localefrom bs4 import BeautifulSoupprint (locale. getdefalocallocale () old = sys. stdout # Save the system default output fp = open ("test1.txt", 'w') # fp = open ("test1.txt", 'w', encoding = "UTF-8 ") # use UTF-8 for file encoding sys. stdout = fp # redirect output to a file f Or I in range (): url = "http://www.0daydown.com/page/" + str (I) # Add an integer to the Url of each page to the page = urllib. request. urlopen (url) soup_packtpage = BeautifulSoup (page) page. close () num = "The Page of:" + str (I) # mark The Page where The current resource belongs print (num) print ("#" * 40) for article in soup_packtpage.find_all ('Article', class _ = "excerpt"): # Use find_all to find all the latest resources published on the current page. print ("Category :". ljust (20), end = ''), print (article. header. a. next) # cate Goryprint ("Title :". ljust (20), end = ''), print (article. h2.string) # title print ("Pulished_time :". ljust (19), end = ''), print (article. p. find ('I', class _ = "icon-time icon12 "). next) # published_timeprint ("Note:", end = '') print (article. p. find_next_sibling (). string) # noteprint ('-' * 50) fp. close () sys. stdout = old # restore system default output print ("Done! ") Input () # Wait for the input. In order not to end immediately after running on the console.

An error is reported after the file is run: the error message is as follows:

Traceback (most recent call last):  File "E:\codefile\Soup\0daydown - 0.2.py", line 37, in <module>    print(article.p.find_next_sibling().string)    #noteUnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 117: illegal multibyte sequence

It can be seen from the text that it is a Unicode encoding error, that gbk cannot encode the \ xa0 byte. I have read many articles about character encoding and read many documents. You can't help it, but you can understand it.

At the beginning, I had no clue. I checked some articles on the Internet and started to simulate the encode. decode is useless at all. There is still a problem with the output. No exception is thrown, but no Chinese characters can be seen at all. All these are \ x .. this is an alternative.

I have no idea about the basic character encoding and character set. How can this problem be solved? So I searched for related articles and gave a link to the articles that I thought were good:

The length of this article is long, but the author has summarized it in too much detail and gained a lot after reading it.

So I wonder why an error is reported when it is written into the file? Command Line output does not solve this problem. Is there a problem with the file? Is there a problem with file encoding? I found an article about Python3 following this question. The link is as follows:

In the Python3 file, the encoding of the file is written. The file encoding can be specified when the file is opened. If not specified, what encoding method is used by default for the file? This article provides a detailed explanation.

The method for opening a file in my source code is fp = open ("test1.txt", 'w'), and an exception is thrown. The exception thrown above indicates that the file is opened by default, the file encoding method is gbk, while GBK cannot encode the \ xa0 character. After checking this character, it turns out to be a special empty character in HTML & nbsp. The default encoding method for the web page to be crawled is UTF-8, which indicates that UTF-8 can be used to encode this character. So can we specify the encoding method of the file? The answer is yes. The original open parameter is encoding, which is used to specify the encoding method. What if we specify it as UTF-8? The following is the correct code. The difference is that fp = open ("test1.txt", 'w') is changed to fp = open ("test1.txt", 'w ', encoding = "UTF-8 "). The Code is as follows:

#-*-Coding: UTF-8-*-# ----------------------------------- # version: 0.1 # note: The latest 10-page resource of 0daydown is found. # Outputs # ----------------------------------- # version: 0.2 # note: Output content to a specified TXT file based on v0.1 # ------------------------------------- import urllib. requestimport sysfrom bs4 import BeautifulSoupold = sys. stdout # Save the default output # fp = open ("test1.txt", 'w') fp = open ("test1.txt", 'w', encoding = "UTF-8 ") # use UTF-8 for file encoding sys. stdout = fp # redirect the output to a file for I in range (): url = "http: // www.0daydow N.com/page/ "+ str (I) # Add an integer to the Url of each page. page = urllib. request. urlopen (url) soup_packtpage = BeautifulSoup (page) page. close () num = "The Page of:" + str (I) # mark The Page where The current resource belongs print (num) print ("#" * 40) for article in soup_packtpage.find_all ('Article', class _ = "excerpt"): # Use find_all to find all the latest resources published on the current page. print ("Category :". ljust (20), end = ''), print (article. header. a. next) # categoryprint ("Title :". ljust (20), end = ''), print (Article. h2.string) # title print ("Pulished_time :". ljust (19), end = ''), print (article. p. find ('I', class _ = "icon-time icon12 "). next) # published_timeprint ("Note:", end = '') print (article. p. find_next_sibling (). string) # noteprint ('-' * 50) fp. close () sys. stdout = old # restore system default output print ("Done! ") Input () # Wait for the input. In order not to end immediately after running on the console.

After running, no error occurs. The file is successfully written and the file is opened, as shown below:


It can be seen that the output result is the same as the output result of the previous command line. Complete solution, OK !! In addition, I took the time to learn about Github today. I heard about it and read the introduction. I found it very powerful. I followed the official website tutorial Helloworld into the next door and registered an account, after preparation, the code will be put on it.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.