Python beautifulsoup solves Chinese garbled characters

Source: Internet
Author: User

When I first captured the webpage with beautifulsoup and encountered Chinese garbled characters, I searched for some methods on the Internet and recorded them first to see which method is useful.

 

1. http://leeon.me/a/beautifulsoup-chinese-page-resolve

import urllib2from BeautifulSoup import BeautifulSouppage = urllib2.urlopen('http://www.leeon.me');soup = BeautifulSoup(page,fromEncoding="gb18030")print soup.originalEncodingprint soup.prettify()

If the Chinese Page code is gb2312 or GBK, you can solve the garbled problem by inputting the fromencoding = "gb18030" parameter in the beautifulsoup constructor, even if the analyzed page is utf8, the use of gb18030 will not be garbled!

This method is feasible after testing, but it must be written under python3 and beautifulsoup4.1From_encoding = "gb18030"

2. http://hi.baidu.com/mengjingchao11/item/604b75e5a426fa2e6cabb856

First, we need to introduce the urllib2 package and use urlopen in urllib2 to open the specified webpage.

Page = urllib2.urlopen (webpage URL)

Charset= Page. headers ['content-type']. Split ('charset = ') [1]. Lower () to find the encoding format of the webpage.

Use beautifulsoup (page. Read (), Fromencoding = charset) Read the webpage content using the encoding format specified by charset.

2. http://hi.baidu.com/dskjfksfj/item/bc658fd1646fef362b35c79b

In the past two days, I used python to crawl the commodity information on the Dangdang page and used beautifulsoup to parse the webpage. However, during the parsing process, some of the text information on the webpage is normal and some are garbled.

After searching for a long time on the Internet, I had a preliminary understanding of the python encoding method and the conversion between various encoding methods of strings. Unfortunately, I tried many of the methods they proposed, the problem is not solved. Finally, I carefully considered the principle of encoding conversion between texts, and solved it with this method:

Because Dangdang's webpage encoding method is simplified Chinese gb2312 (view the webpage source code, you can see <meta http-equiv = "Content-Type" content = "text/html; charset = gb2312 "/>), while the internal encoding method of python is Unicode, the previous Code is as follows:

Contentall = urllib. urlopen (urllink). Read ()
Soup = beautifulsoup. beautifulsoup (contentall) # generate beautifulsoup object

After the URL is located to obtain the HTML document object, the encoding method is the webpage text encoding method gb2312. In this case, the beautifulsoup object is sent to generate a beautifulsoup object, and Python considers contentall as Unicode encoding, that is, the system has considered contentall unicode encoding, which will cause garbled characters to be displayed in the future, because garbled characters will be displayed when the webpage is viewed on the webpage.

Therefore, before sending contentall to generate the beautifulsoup object, you must decode it as Unicode and use the code gb2312. The modified code is as follows:

Contentall = urllib. urlopen (urllink). Read ()
Soup = beautifulsoup. beautifulsoup (contentall. Decode ('gb2312', 'ignore') # generate a beautifulsoup object

The ignore parameter is added during decoding because some of the decoding processes seem to be unable to be properly decoded. After this parameter is added, this part can be skipped.

In the future, you also need to write the obtained information into the local text. Because the obtained objects are unicode encoded, You can see meaningful content only by performing gb2312 encoding (encode, therefore, in the file writing function, gb2312 is used for encoding and then written into the text file. As follows:

Def writefile (data, filepath ):
OUTFILE = open (filepath, 'AB ')
Data = data. encode ('gb2312', 'ignore ')
OUTFILE. Write (data)
OUTFILE. Close ()

4. http://www.coder4.com/archives/3621

In fact, fromencoding = "gb18030" is not a permanent method, when facing the iso-8859-1 encoding of the Chinese web page, there will still be garbled.

The root cause of BS garbled code is that its internal guess encoding mechanism is not perfect.

Therefore, the most fundamental solution is to use an automatic encoding detection tool to obtain the real code of the webpage, such as chardet. Set the obtained encoding to the fromencoding construction parameter of BS !!!

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.