[Solved] beautifulsoup has obtained the Unicode soup, but the print output is garbled.

Last Update:2014-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Problem]

Problems encountered by someone:

Question about beautifulsoup grabbing tables and importing SAE databases (please help me)

Simply put:

Use the following code:

1234567 import re,urllib2from BeautifulSoup import BeautifulSoupfrom urllib import urlopendoc=urllib2.urlopen("http://www.w3school.com.cn/html/html_tables.asp")soup = BeautifulSoup(doc,fromEncoding="GB2312") It's useless to make changes here.a=soup.findAll("td")print a

However, the output is garbled:

[Solution process]

1. Here we will pass the actual test and then verify the information. The complete code and explanation are as follows:

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485 #!/usr/bin/python# -*- coding: utf-8 -*-"""Function:[Solved] beautifulsoup has obtained the Unicode soup, but the print output is garbled. http://www.crifan.com/beautifulsoup_already_got_unicode_soup_but_print_messy_code Author: Crifan LiVersion: 2013-05-30Contact: http://www.crifan.com/contact_me/""" import re,urllib2from BeautifulSoup import BeautifulSoupfrom urllib import urlopen def scrapeW3school(): html = urllib2.urlopen("http://www.w3school.com.cn/html/html_tables.asp"); # Soup = beautifulsoup (HTML); # the effect of this sentence is the same: # The actual test result is: fromencoding is not added, and it can be automatically and correctly (to determine whether the original character encoding is gb2312, and then) parsed (then the Unicode soup ). soup = BeautifulSoup(html, fromEncoding="GB2312"); #print "soup=",soup; allTdSoup = soup.findAll("td"); print "type(allTdSoup)=",type(allTdSoup); # Type (alltdsoup) = <class 'beotiulsoup. result'>, but it is actually a list print "len(allTdSoup)=",len(allTdSoup); # Len (alltdsoup) = 32. The list length here is 32. print "allTdSoup=",allTdSoup; # allTdSoup= [<td>row 1, cell 1</td>, <td>row 1, cell 2</td>, <td>row 2, ......, <td><a href="/tags/tag_tfoot.asp"><tfoot></a></td> #, <TD> Why? /TD>, <TD> <a href = "/tags/tag_col.asp"> & lt; Col & gt; </a> </TD>, <TD> too many threads have been written. # € €С €? /TD>, <TD> <a href = "/tags/tag_colgroup.asp"> & lt; colgroup & gt; </a> </TD>, <TD> Why? /TD>] # Here, it looks garbled, but in fact, the alltdsoup obtained here is a list, and each soup in it, although the internal encoding is normal Unicode # But it will still print out garbled characters, because: #1. First read the explanation on the official website: #http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html # "When you call _ STR __, pretencode or rendercontents, you can specify the output encoding. The default encoding (STR uses) is the UTF-8. " # So: # Here, if alltdsoup is printed, that is, a soup list is printed. Therefore, if every soup in the list (which is essentially an object) is output as a string, the _ STR _ attribute is called by default. # So it is equivalent: # For each soup in alltdsoup: # Call the _ STR _ of the soup to obtain the corresponding string (the content of the soup) # The final combination outputs the results you see, such as ["XXX", "XXX, # "XXX" indicates the result of each soup. _ STR _. # Here, the value of _ STR _ for each Soup: # As described on the official website, the default is the UTF-8 code # So the string obtained here is a UTF-8-encoded string, # Print output to cmd # CMD is GBK encoded # So, The UTF-8 encoding characters, in GBK cmd display, it shows garbled # Where: # (1) if you are not familiar with cmd GBK, go to the following link: # Set the character encoding: Simplified Chinese GBK/English #http://www.crifan.com/files/doc/docbook/soft_dev_basic/release/html/soft_dev_basic.html#cmd_encoding # (2) if for GBK, The UTF-8 itself does not understand, see: # Detailed description of character encoding #http://www.crifan.com/files/doc/docbook/char_encoding/release/html/char_encoding.html # (3) for soup itself, it is actually unicode encoding, so you can specify the encoding when _ STR _ output is GBK, as stated on the official website, so that non-garbled Chinese characters are correctly displayed here for eachTdSoup in allTdSoup: print "type(eachTdSoup)=",type(eachTdSoup); # Type (eachtdsoup) = <type 'instance'>, indicating that the instance type is beautifulsoup print "eachTdSoup.string=",eachTdSoup.string; # Output the string attribute of soup, that is, the part of the string content in the tag, which is Unicode. Therefore, non-garbled Chinese characters can be normally output. print "type(eachTdSoup.string)=",type(eachTdSoup.string); # Note that the Unicode type is not here, but: type (eachtdsoup. String) = <class 'beautifulsoup. navigablestring'> print "eachTdSoup=",eachTdSoup; # Directly output soup itself, so equivalent to: eachtdsoup. _ STR _ = eachtdsoup. _ STR _ ("UTF-8"), so when encountering Chinese is garbled print "eachTdSoup.renderContents()=",eachTdSoup.renderContents(); # Direct output content itself, the default is also used is UTF-8, so when encountering Chinese is also garbled print "eachTdSoup.__str__(‘GBK‘)=",eachTdSoup.__str__(‘GBK‘);# The GBK encoding is specified, so non-garbled Chinese characters can be displayed normally. # Extract some of the output: # type(eachTdSoup)= <type ‘instance‘> # eachTdSoup.string= row 1, cell 1 # type(eachTdSoup.string)= <class ‘BeautifulSoup.NavigableString‘> # eachTdSoup= <td>row 1, cell 1</td> # eachTdSoup.renderContents()= row 1, cell 1 # eachTdSoup.__str__(‘GBK‘)= <td>row 1, cell 1</td> # ...... # type(eachTdSoup)= <type ‘instance‘> # Eachtdsoup. String = defines the group of table columns. # type(eachTdSoup.string)= <class ‘BeautifulSoup.NavigableString‘> # Eachtdsoup = <TD> Why? /TD> # Eachtdsoup. rendercontents ( # Eachtdsoup. _ STR _ ('gbk') = <TD> defines the group of table columns. </TD> # # (4) In addition, for beautifulsoup, you can guess Its Encoding Based on the charset in HTML. If you do not know it, see: # [Finishing] On the HTML web page source code character encoding (charset) format (gb2312, GBK, UTF-8, ISO8859-1, etc.) Interpretation #http://www.crifan.com/summary_explain_what_is_html_charset_and_common_value_of_gb2312_gbk_utf_8_iso8859_1 if __name__ == "__main__": scrapeW3school();

[Summary]

So:

On the surface, it seems that the soup obtained after beautifulsoup resolution is garbled, but it is actually correct (after the original gb2312 encoding) Resolution (UNICODE.

The reason for garbled, that is because, when printing soup, the call is _ STR __, its default is the UTF-8, so the output to the GBK cmd, it is displayed as garbled.

Summary:

You have to understand:

The logic of various encoding itself: What is GBK, What Is UTF-8, what is Unicode
Beautifulsoup logic: You can use fromencoding to correctly parse HTML as Unicode-encoded
Print the logic of an object: Call the object's _ STR _ internally to obtain the corresponding string. Here, it corresponds to soup's _ STR __
Soup's _ STR _ logic: the default encoding is UTF-8
CMD logic: (in a Chinese system) encoded as GBK

Then we can understand the root cause of the problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Solved] beautifulsoup has obtained the Unicode soup, but the print output is garbled.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Solved] beautifulsoup has obtained the Unicode soup, but the print output is garbled.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support