Information on crawling "0daydown" websites using BeautifulSoup (2)--character encoding problem solving

Last Update:2015-03-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The program in the previous article implemented the crawl 0daydown the latest 10 pages of information, the output is directly output to the console inside. I'm going to write them into a TXT document when I improve the code again. That's the problem.

Initially my code is as follows:

#-*-coding:utf-8-*-#-------------------------------------#version: 0.1#note: Implemented a 10-page resource to find the latest release of 0daydown. #-------------------------------------#-------------------------------------#version: 0.2#note: Output content to a specified TXT file based on v0.1 #-------------------------------------import urllib.requestimport sysimport localefrom BS4 Import Beautifulsoupprint (Locale.getdefaultlocale ()) Old = Sys.stdout #保存系统默认输出fp = open ("Test1.txt", ' w ') #fp = open ( "Test1.txt", ' W ', encoding= "Utf-8") #以utf-8 for file encoding sys.stdout = FP #输出重定向到一个文件中for i in Range (1,11): url = "/http" www.0daydown.com/page/"+ str (i) #每一页的Url只需在后面加上整数就行page = Urllib.request.urlopen (URL) soup_packtpage = BeautifulSoup ( Page) page.close () num = "The page of:" + str (i) #标注当前资源属于第几页print (num) print ("#" *40) for article in Soup_packtpage.find_all (' article ', class_= "excerpt"): #使用find_all查找出当前页面发布的所有最新资源print ("Category:". Ljust (), end= "), Print ( Article.header.a.next) #categoryprint ("Title:". Ljust (), end= "), print (article.h2.string) #title    Print ("Pulished_time:". Ljust (+), end= "), Print (Article.p.find (' I ', class_=" Icon-time icon12 "). Next) #published_    Timeprint ("Note:", end= ") print (Article.p.find_next_sibling (). String) #noteprint ('-' *50 ') fp.close () Sys.stdout = old #恢复系统默认输出print ("done!") Input () #等待输入, in order not to let the console run immediately after the end.

Error message after running file:

Traceback (most recent):  File "E:\codefile\Soup\0daydown-0.2.py", line PNs, in <module>    print (AR Ticle.p.find_next_sibling (). String)    #noteUnicodeEncodeError: ' GBK ' codec can ' t encode character ' \xa0 ' in position 117:illegal multibyte sequence

As can be seen from the text is a Unicode encoding error, said GBK can not encode \xa0 this byte. On the question of character encoding I read a lot of articles, looked up a lot of information. Novice no way, but it is OK to understand.

At first I had no clue at all, to check some articles on the internet began to imitate the Encode.decode, there is no use, the output is still a problem, do not throw an exception, but can not see the Chinese characters, all are some \x. This alternative.

Problem to traced source, I even basic character encoding and character set these things are not clear, how can solve the problem? So I search the relevant article, give me feel Good article links are as follows:

Character encoding Details Although this article is long, but the author summarizes too detailed, looked after the harvest is very big.

So I wondered why it would be an error to write to the file. The command line output does not have this problem. Is there a problem with the file? A problem with the encoding of the file? I followed this question to find an article about Python3 file, very good, the link is as follows:

Python3 file which is written in the file encoding, the original file can be specified when the file is encoded, if not specified, then the file by default what encoding? This article makes a detailed explanation.

The way I open the file in my source code is: fp = open ("Test1.txt", ' W '), the result throws an exception, the exception thrown from above can explain the default open file, the file encoding is GBK, and GBK is not encoded \xa0 this character, checked the next character, The original is a unique empty character in HTML. The default encoding for pages to crawl is utf-8, which means that utf-8 can encode this character. So can we specify how the file is encoded? The answer is yes, the original open also has a parameter is encoding, used to specify the encoding method, if we specify it as Utf-8 what will happen? The following is the correct code, except that the FP = open ("Test1.txt", ' W ') has changed to FP = open ("Test1.txt", ' W ', encoding= "Utf-8"). The code is as follows:

#-*-coding:utf-8-*-#-------------------------------------#version: 0.1#note: Implemented a 10-page resource to find the latest release of 0daydown. #-------------------------------------#-------------------------------------#version: 0.2#note: Output content to a specified TXT file based on v0.1 #-------------------------------------import urllib.requestimport sysfrom BS4 Import Beautifulsoupold = sys.stdout #保存系统默认输出 #fp = open ("Test1.txt", ' w ') fp = open ("Test1.txt", ' W ', encoding= "Utf-8") # File encoding with Utf-8 sys.stdout = fp #输出重定向到一个文件中for i in Range (1,11): url = "http://www.0daydown.com/page/" + str (i) #每一页的Ur l just add an integer to the line page = Urllib.request.urlopen (URL) soup_packtpage = BeautifulSoup (page) page.close () num = "The page of:" + str (i) #标注当前资源属于第几页print (num) print ("#" *40) for article in Soup_packtpage.find_all (' article ', class_= "excerpt"): #使用find_ All find out all the latest resources for the current page print ("Category:". Ljust (), end= "), print (Article.header.a.next) #categoryprint (" Title: ". Ljust (), end= "), print (article.h2.string) #title print (" Pulished_time: ". Ljust (+), end="), Print (Article.p.find (' I ', class_= "Icon-time icon12"). Next) #published_timeprint ("Note:", end= ") print ( Article.p.find_next_sibling (). String) #noteprint ('-' *50) fp.close () Sys.stdout = Old #恢复系统默认输出print ("done!") Input () #等待输入, in order not to let the console run immediately after the end.

After running, no error generated, successfully written to the file, open the file, displayed as follows:

As you can see, the output is the same as the result of the previous command line output. satisfactorily resolved, ok!! In addition, today took the time to learn the next GitHub, early to hear the name, read the next introduction, found very powerful, followed by the official website tutorial HelloWorld into the next door, registered an account, ready to put the code on it.

Information on crawling "0daydown" websites using BeautifulSoup (2)--character encoding problem solving

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Information on crawling "0daydown" websites using BeautifulSoup (2)--character encoding problem solving

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Information on crawling "0daydown" websites using BeautifulSoup (2)--character encoding problem solving

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support