Transfer from blogger Crifan http://againinput4.blog.163.com/blog/static/1727994912011111011432810/
In play WordPress a blog Moving tool Blogmover, which contains a few Python scripts, including one is 163 blog moving with the 163-blog-mover.py, to crawl NetEase blog log, and then export XML.
But its tools are now (2011-12-10) dead. After a little modification, you can achieve the title of the article.
Usage is also the original usage:
163-blog-mover.py-f http://againinput4.blog.163.com/blog/static/172799491201111893853485/ |
Get the title of this log:
"Resolved" allows hi-baidu-mover_v2.py error: unboundlocalerror:local variable ' linknode ' referenced before assignment
Contains Chinese, after printing the title to log, but found that the Chinese part of the display garbled:
????·?????????? Íhi-baidu-mover_v2.py??? Í?? Unboundlocalerror:local variable ' linknode ' referenced before assignment
So want to remove garbled, correctly display Chinese.
"Resolution Process"
1. The text itself, is crawled from the Web page, about the log of the Chinese code, has also been from the Web page:
<meta http-equiv="Content-type" content="TEXT/HTML;CHARSET=GBK"/ >
It looks as if it's GBK.
In Python, the original default encoding was found to be ASCII, which was later passed:
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')
To set the default encoding to Utf-8, but the result output is garbled.
2. Later on the internet to find a bunch of posts, including the last one:
Http://www.tangc.com.cn/view_article_117.html
To try to decode it first, then encode:
String.decode (' GBK '). Encode (Utf-8 ')
The result is still garbled:
Long timestamp lu distinction Lu Mang timestamp long donkey row kinase row riveting hi-baidu-mover_v2.py Lu Lu riveting LU unboundlocalerror:local variable ' linknode ' referenced before assignment
And then try something else, like:
Temp.string.strip (). Decode (' GBK '). Encode (' MCBs ')
The output is also garbled, and is the same as before the conversion.
In short, still can not and solve garbled problem.
3. Later, when learning beautiful soup:
Http://www.crummy.com/software/BeautifulSoup/documentation.zh.html#contents
Beautiful Soup will try different encodings sequentially to convert your document to Unicode:
fromEncoding
a constructor that can pass the encoding type to soup by parameter
- Find the encoding type through the document itself: for example, XML declarations or meta tags for HTML documents
http-equiv
. If beautiful soup finds the encoding type in the document, it tries to convert the document using the found type. However, if you clearly specify an encoding type, and the encoding is used successfully: it ignores any encoding types it finds in the document.
- Determine the encoding by sniffing the data at the beginning of the file. If the encoding type can be detected, it will be one of these: utf-* encoding, EBCDIC, or ASCII.
- through
chardet
Library, sniff code, if you install this library.
- UTF-8
- Windows-1252
learned that beautiful soup by default already can through the link you give the HTML source of Meta HTTP-EQUIV Head, to parse CharSet, already can know the page encoding is shown above GBK, and automatically will go to the code to obtain the content of the Web page, The output is then automatically utf-8 Unicode.
So in Python go:
Print "# # #originalEncoding =", soup.originalencoding, "declaredhtmlencoding=", Soup.declaredhtmlencoding, " Fromencoding= ", soup.fromencoding can already be output:
# # #originalEncoding = windows-1252 declaredhtmlencoding= gbk fromencoding= None |
, but it is still very strange, why the original page encoding, HTML declaration is GBK, and beautiful soup parse out the original code originalencoding is windows-1252, the two are not the same (I tried another Baidu page , Declaredhtmlencoding is gbk,originalencoding is also declaredhtmlencoding).
Therefore, you manually try, at the open URL, to beautiful soup pass the GBK parameter, namely:
page = Urllib2.build_opener (). Open (req). Read () Soup = BeautifulSoup (page, fromencoding= "GBK") |
The result is still not working.
On this issue, an explanation was later found on the Web:
Beautiful Soup gb2312 garbled problem
http://groups.google.com/group/python-cn/browse_thread/thread/cb418ce811563524
Please note that gb2312 is not "gb2312", please change to GB18030 for gb2312.
Microsoft maps gb2312 and GBK to GB18030, which facilitates some people and confuses some people. |
That is, actually the page is GB18030 encoded, so follow here:
The morning to solve the problem of webpage parsing garbled
http://blog.csdn.net/fanfan19881119/article/details/6789366
(Original source: Http://leeon.me/a/beautifulsoup-chinese-page-resolve)
method, pass GB18030 to Fromencoding, before you can:
page = Urllib2.build_opener (). Open (req). Read () Soup = BeautifulSoup (page, fromencoding= "GB18030") |
It also explains why the GBK encoding of the HTML is nominal, but resolves it to windows-1252:
recently need to write a Python RSS crawl parser, using the feed parser. But for the Baidu News RSS, its encoding method for Gb2312,feed parser detection out of the code is windows-1252, the results of Chinese content is a bunch of garbled.
The problem is not that Feedparser cannot recognize gb2312 encoding, but that people tend to equate gb2312 with GBK encoding, some of which have already used the characters in GBK encoding, and still claim that the content is gb2312 encoded. Feedparser strictly follows the gb2312 character set range for gb2312 encoding, and when a character beyond this range is detected, the encoding is rolled back to windows-1252. Because Baidu's RSS is actually used should be GBK code, which contains beyond the gb2312 range of characters, so Feedparser decided to return the code windows-1252, leading to the phenomenon of Chinese garbled.
Summary
Later, using Python's Beautiful soup to parse the Chinese web page:
1. if the encoding of the Web page itself is nominal, and the code of the character itself is fully compliant, that is, no part of the characters beyond its nominal encoding, such as labeled GBK, the content of the Web page, are indeed GBK encoded, Other characters that are not exceeded (such as those belonging to GB18030), are available by :
page = Urllib2.urlopen (URL) soup = beautifulsoup (page) #此处不需要传递参数, BeautifulSoup can also fully parse the encoding used by the page content Print soup.originalencoding |
and get the real page encoded .
2. Before discussing the problem of garbled Chinese characters, we will first explain the encoding:
The development of time is, gb2312,gbk,gb18030, coding technology is compatible.
From the number of Chinese characters included, it is: GB2312 < GBK < GB18030, and therefore, it is possible to appear above, the nominal GB2312, part of the characters used in GBK, or the nominal GBK, part of the characters used in the GB18030 inside, So by others coding tool Parsing error, return to think that the code is the most basic windows-2152.
And my situation here is the latter, NetEase blog Web page, claiming to be GBK, but a lot of Chinese characters with the GB18030.
In general, the reality is that because the person who wrote the code of the Web page is different from the character encoding, and the encoding that he claims in the Web page is different from the code used in the actual Web page, the Chinese page is garbled. Therefore, it is possible to solve this kind of problem by understanding the relationship between the Chinese characters and the historical logic before and after.
For more detailed explanations of Chinese character encoding, please refer to here:
Chinese character encoding standard +unicode+code Page
Http://bbs.chinaunix.net/thread-3610023-1-1.html
So:
(1) If the page is labeled as GB2312, but some characters are used GBK , then the solution is to call BeautifulSoup, pass the parameters fromencoding= "GBK", namely:
page = Urllib2.build_opener (). Open (req). Read ()
Soup = BeautifulSoup (page, fromencoding= "GBK")
2) If the page is labeled as GBK, but some characters are used GB18030, then the solution is to call BeautifulSoup, pass the parameters fromencoding= "GB18030", namely:
page = Urllib2.build_opener (). Open (req). Read () Soup = BeautifulSoup (page, fromencoding= "GB18030") |
(3) In fact, because the GB18030 from the number of characters are covered GB2312 and GBK, so if it is the above two arbitrary circumstances, that is, as long as the Chinese characters appear garbled, whether it is the nominal GB2312 used in GBK, or the nominal GBK used in GB18030, It is also possible to pass GB18030 directly , namely:
soup = beautifulsoup (page, fromencoding= "GB18030") |
Can.
Python characters are garbled (gb2312,gbk,gb18030 related issues)