Python+requests Crawl site encountered Chinese garbled how to do?

Last Update:2017-04-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Category: Python/ruby

I've recently started using Python to crawl data, using Python's own urllib and third-party library requests, parsing HTML using BeautifulSoup and lxml
Here lxml,lxml is a python HTML, XML parsing library, Lxml uses XPath to quickly and easily locate elements and get information. Get down to the chase.

1. The problem of Chinese garbled characters encountered
1.1 Simple Start
Using requests to choose the content of the site is very convenient, a simple code snippet requires only 2-3 lines of code.

Click ( here) to collapse or open

URL = ' http//www.pythonscraping.com/'
Req = Requests. Get(URL)
Print(req. Text)
Tree = html. FromString(Req.text)
Print(tree. XPath("//h1[@class = ' title ']/text ()"))

The code snippet above works on the 3 line (2,4,5) code to get what we want. Of course, you have to import a series of packages, such as requests, lxml, HTML, and so on. Of course, because http//www.pythonscraping.com/is an English website, there is no Chinese garbled problem.
1.2 The beginning of the trouble
The idea was to write some basic modules that would be easy to invoke after development and reduce repetitive work. In order to ensure that the code in any case will not be a bug, so you want to use the same code to crawl the Chinese site to get the text inside
Modify the two lines of code in the above code:

Click ( here) to collapse or open

URL = ' http://sports.sina.com.cn/g/premierleague/index.shtml '
Print(tree. XPath("//span[@class = ' sec_blk_title ']/text ()"))

Running the program can be found in the statement print (req.text) output content, the Chinese font is already garbled. The final result output is ['?????? È§\x86é?\x91 ', '?? \x80?\x9c\x9f?\x9b\x9eé?? ')
2 garbled Solution
2.1 Trial and Error
Because the previous crawl csdn last page did not appear garbled problem, but in Sina Sports website garbled, so at that time thought not coding problem, thought is the document compression problem. Because CSDN gets the page header without the "content-encodings" attribute, but Sina sports gets the page header has the "content-encodings" attribute--"content-encoding:gzip".
View solutions for several related issues on the Web:
1. Http://stackoverflow.com/questions/3122145/zlib-error-error-3-while-decompressing-incorrect-header-check
2. http://blog.csdn.net/pxf1234567/article/details/42006697
3. http://blog.csdn.net/bytxl/article/details/21278249

Summary: Reference to the above-mentioned literature, the results still do not solve the problem, but consider whether the direction is wrong. However, this part of the work is not done in vain, many websites return data will have compression problems, after the work can also be used.

2.2 Garbled Ultimate Solution
Later in the official document Response-content related content, explained that requests will automatically decode the content from the server. Requests based on the HTTP header's encoding of the response, provided that the HTTP headers of the response document does not have a relevant character set description. The official documentation also shows that if you create your own code and use the codecs module to register, you can easily use this decoder name as the value of r.encoding, and then by requests to handle the encoding for you. (You don't use the codecs module, so there's no code here, but using the codecs module is the easiest way to use the official version.) ）
Another official document fragment explicitly says Reponse encoding processing:
Requests follows the RFC standard, encoding uses ISO-8859-1.
Requests does not guess the encoding unless the explicitly specified character set is present in the HTTP header, and the Content-type header field contains the text value.

Now directly on the experimental results, add the following code snippet to the original code:

Click ( here) to collapse or open

Print(req. Headers[' Content-type '])
Print(req. Encoding)
Print(req.apparent_encoding)
Print(Requests. Utils.get_encodings_from_content(page_content. Text))

The output results are:
Text/html
Encoding of Iso-8859-1#response content
Utf-8#response code set in headers
[' Utf-8 '] #response返回的html the code set in the header tag
The returned content is ' iso-8859-1 ', so garbled, and actually we should use ' utf-8 ' encoding

Summary: When the response code is ' iso-8859-1 ', we should first look for the code set by the response header, and if this code does not exist, look at the code for the header setting of the returned HTML, as follows:

Click ( here) to collapse or open

If req. encoding == ' iso-8859-1 ':
Encodings = Requests. Utils. Get_encodings_from_content(req. Text)
If encodings:
Encoding = Encodings[0]
Else:
Encoding = req. apparent_encoding
Encode_content = req. Content. Decode(encoding, ' replace '). Encode(' utf-8 ') , ' replace ')

Resources:
1. http://blog.csdn.net/a491057947/article/details/47292923
2. http://docs.python-requests.org/en/latest/user/quickstart/#response-content

Python+requests Crawl site encountered Chinese garbled how to do?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More