Python characters are garbled (gb2312,gbk,gb18030 related issues)

Source: Internet
Author: User

Transfer from blogger Crifan http://againinput4.blog.163.com/blog/static/1727994912011111011432810/

In play WordPress a blog Moving tool Blogmover, which contains a few Python scripts, including one is 163 blog moving with the 163-blog-mover.py, to crawl NetEase blog log, and then export XML.

But its tools are now (2011-12-10) dead. After a little modification, you can achieve the title of the article.

Usage is also the original usage:

163-blog-mover.py-f http://againinput4.blog.163.com/blog/static/172799491201111893853485/

Get the title of this log:

"Resolved" allows hi-baidu-mover_v2.py error: unboundlocalerror:local variable ' linknode ' referenced before assignment

Contains Chinese, after printing the title to log, but found that the Chinese part of the display garbled:

????·?????????? Íhi-baidu-mover_v2.py??? Í?? Unboundlocalerror:local variable ' linknode ' referenced before assignment

So want to remove garbled, correctly display Chinese.

"Resolution Process"

1. The text itself, is crawled from the Web page, about the log of the Chinese code, has also been from the Web page:

<meta http-equiv="Content-type" content="TEXT/HTML;CHARSET=GBK"/ >

It looks as if it's GBK.
In Python, the original default encoding was found to be ASCII, which was later passed:

Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')

To set the default encoding to Utf-8, but the result output is garbled.

2. Later on the internet to find a bunch of posts, including the last one:

Http://www.tangc.com.cn/view_article_117.html

To try to decode it first, then encode:

String.decode (' GBK '). Encode (Utf-8 ')

The result is still garbled:

Long timestamp lu distinction Lu Mang timestamp long donkey row kinase row riveting hi-baidu-mover_v2.py Lu Lu riveting LU unboundlocalerror:local variable ' linknode ' referenced before assignment

And then try something else, like:

Temp.string.strip (). Decode (' GBK '). Encode (' MCBs ')

The output is also garbled, and is the same as before the conversion.

In short, still can not and solve garbled problem.

3. Later, when learning beautiful soup:

Http://www.crummy.com/software/BeautifulSoup/documentation.zh.html#contents

Beautiful Soup will try different encodings sequentially to convert your document to Unicode:

    • fromEncodinga constructor that can pass the encoding type to soup by parameter
    • Find the encoding type through the document itself: for example, XML declarations or meta tags for HTML documents http-equiv . If beautiful soup finds the encoding type in the document, it tries to convert the document using the found type. However, if you clearly specify an encoding type, and the encoding is used successfully: it ignores any encoding types it finds in the document.
    • Determine the encoding by sniffing the data at the beginning of the file. If the encoding type can be detected, it will be one of these: utf-* encoding, EBCDIC, or ASCII.
    • through chardet Library, sniff code, if you install this library.
    • UTF-8
    • Windows-1252

learned that beautiful soup by default already can through the link you give the HTML source of Meta HTTP-EQUIV Head, to parse CharSet, already can know the page encoding is shown above GBK, and automatically will go to the code to obtain the content of the Web page, The output is then automatically utf-8 Unicode.

So in Python go:

Print "# # #originalEncoding =", soup.originalencoding, "declaredhtmlencoding=", Soup.declaredhtmlencoding, " Fromencoding= ", soup.fromencoding can already be output:

# # #originalEncoding = windows-1252 declaredhtmlencoding= gbk fromencoding= None

, but it is still very strange, why the original page encoding, HTML declaration is GBK, and beautiful soup parse out the original code originalencoding is windows-1252, the two are not the same (I tried another Baidu page , Declaredhtmlencoding is gbk,originalencoding is also declaredhtmlencoding).

Therefore, you manually try, at the open URL, to beautiful soup pass the GBK parameter, namely:

page = Urllib2.build_opener (). Open (req). Read ()
Soup = BeautifulSoup (page, fromencoding= "GBK")

The result is still not working.

On this issue, an explanation was later found on the Web:
Beautiful Soup gb2312 garbled problem

http://groups.google.com/group/python-cn/browse_thread/thread/cb418ce811563524

Please note that gb2312 is not "gb2312", please change to GB18030 for gb2312.

Microsoft maps gb2312 and GBK to GB18030, which facilitates some people and confuses some people.

That is, actually the page is GB18030 encoded, so follow here:

The morning to solve the problem of webpage parsing garbled

http://blog.csdn.net/fanfan19881119/article/details/6789366

(Original source: Http://leeon.me/a/beautifulsoup-chinese-page-resolve)

method, pass GB18030 to Fromencoding, before you can:

page = Urllib2.build_opener (). Open (req). Read ()
Soup = BeautifulSoup (page, fromencoding= "GB18030")

It also explains why the GBK encoding of the HTML is nominal, but resolves it to windows-1252:

recently need to write a Python RSS crawl parser, using the feed parser. But for the Baidu News RSS, its encoding method for Gb2312,feed parser detection out of the code is windows-1252, the results of Chinese content is a bunch of garbled.

The problem is not that Feedparser cannot recognize gb2312 encoding, but that people tend to equate gb2312 with GBK encoding, some of which have already used the characters in GBK encoding, and still claim that the content is gb2312 encoded. Feedparser strictly follows the gb2312 character set range for gb2312 encoding, and when a character beyond this range is detected, the encoding is rolled back to windows-1252. Because Baidu's RSS is actually used should be GBK code, which contains beyond the gb2312 range of characters, so Feedparser decided to return the code windows-1252, leading to the phenomenon of Chinese garbled.

Summary

Later, using Python's Beautiful soup to parse the Chinese web page:

1. if the encoding of the Web page itself is nominal, and the code of the character itself is fully compliant, that is, no part of the characters beyond its nominal encoding, such as labeled GBK, the content of the Web page, are indeed GBK encoded, Other characters that are not exceeded (such as those belonging to GB18030), are available by :

page = Urllib2.urlopen (URL)
soup = beautifulsoup (page) #此处不需要传递参数, BeautifulSoup can also fully parse the encoding used by the page content
Print soup.originalencoding

and get the real page encoded .

2. Before discussing the problem of garbled Chinese characters, we will first explain the encoding:

The development of time is, gb2312,gbk,gb18030, coding technology is compatible.

From the number of Chinese characters included, it is: GB2312 < GBK < GB18030, and therefore, it is possible to appear above, the nominal GB2312, part of the characters used in GBK, or the nominal GBK, part of the characters used in the GB18030 inside, So by others coding tool Parsing error, return to think that the code is the most basic windows-2152.

And my situation here is the latter, NetEase blog Web page, claiming to be GBK, but a lot of Chinese characters with the GB18030.

In general, the reality is that because the person who wrote the code of the Web page is different from the character encoding, and the encoding that he claims in the Web page is different from the code used in the actual Web page, the Chinese page is garbled. Therefore, it is possible to solve this kind of problem by understanding the relationship between the Chinese characters and the historical logic before and after.

For more detailed explanations of Chinese character encoding, please refer to here:

Chinese character encoding standard +unicode+code Page

Http://bbs.chinaunix.net/thread-3610023-1-1.html

So:

(1) If the page is labeled as GB2312, but some characters are used GBK , then the solution is to call BeautifulSoup, pass the parameters fromencoding= "GBK", namely:

page = Urllib2.build_opener (). Open (req). Read ()
Soup = BeautifulSoup (page, fromencoding= "GBK")

2) If the page is labeled as GBK, but some characters are used GB18030, then the solution is to call BeautifulSoup, pass the parameters fromencoding= "GB18030", namely:

page = Urllib2.build_opener (). Open (req). Read ()
Soup = BeautifulSoup (page, fromencoding= "GB18030")

(3) In fact, because the GB18030 from the number of characters are covered GB2312 and GBK, so if it is the above two arbitrary circumstances, that is, as long as the Chinese characters appear garbled, whether it is the nominal GB2312 used in GBK, or the nominal GBK used in GB18030, It is also possible to pass GB18030 directly , namely:

soup = beautifulsoup (page, fromencoding= "GB18030")

Can.

Python characters are garbled (gb2312,gbk,gb18030 related issues)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.