Category: Python/ruby
I've recently started using Python to crawl data, using Python's own urllib and third-party library requests, parsing HTML using BeautifulSoup and lxml
Here lxml,lxml is a python HTML, XML parsing library, Lxml uses XPath to quickly and easily locate elements and get information. Get down to the chase.
1. The problem of Chinese garbled characters encountered
1.1 Simple Start
Using requests to choose the content of the site is very convenient, a simple code snippet requires only 2-3 lines of code.
Click ( here) to collapse or open
- URL = ' http//www.pythonscraping.com/'
- Req = Requests. Get(URL)
- Print(req. Text)
- Tree = html. FromString(Req.text)
- Print(tree. XPath("//h1[@class = ' title ']/text ()"))
The code snippet above works on the 3 line (2,4,5) code to get what we want. Of course, you have to import a series of packages, such as requests, lxml, HTML, and so on. Of course, because http//www.pythonscraping.com/is an English website, there is no Chinese garbled problem.
1.2 The beginning of the trouble
The idea was to write some basic modules that would be easy to invoke after development and reduce repetitive work. In order to ensure that the code in any case will not be a bug, so you want to use the same code to crawl the Chinese site to get the text inside
Modify the two lines of code in the above code:
Click ( here) to collapse or open
- URL = ' http://sports.sina.com.cn/g/premierleague/index.shtml '
- Print(tree. XPath("//span[@class = ' sec_blk_title ']/text ()"))
Running the program can be found in the statement print (req.text) output content, the Chinese font is already garbled. The final result output is ['?????? ȧ\x86é?\x91 ', '?? \x80?\x9c\x9f?\x9b\x9eé?? ')
2 garbled Solution
2.1 Trial and Error
Because the previous crawl csdn last page did not appear garbled problem, but in Sina Sports website garbled, so at that time thought not coding problem, thought is the document compression problem. Because CSDN gets the page header without the "content-encodings" attribute, but Sina sports gets the page header has the "content-encodings" attribute--"content-encoding:gzip".
View solutions for several related issues on the Web:
1. Http://stackoverflow.com/questions/3122145/zlib-error-error-3-while-decompressing-incorrect-header-check
2. http://blog.csdn.net/pxf1234567/article/details/42006697
3. http://blog.csdn.net/bytxl/article/details/21278249
Summary: Reference to the above-mentioned literature, the results still do not solve the problem, but consider whether the direction is wrong. However, this part of the work is not done in vain, many websites return data will have compression problems, after the work can also be used.
2.2 Garbled Ultimate Solution
Later in the official document Response-content related content, explained that requests will automatically decode the content from the server. Requests based on the HTTP header's encoding of the response, provided that the HTTP headers of the response document does not have a relevant character set description. The official documentation also shows that if you create your own code and use the codecs module to register, you can easily use this decoder name as the value of r.encoding, and then by requests to handle the encoding for you. (You don't use the codecs module, so there's no code here, but using the codecs module is the easiest way to use the official version.) )
Another official document fragment explicitly says Reponse encoding processing:
Requests follows the RFC standard, encoding uses ISO-8859-1.
Requests does not guess the encoding unless the explicitly specified character set is present in the HTTP header, and the Content-type header field contains the text value.
Now directly on the experimental results, add the following code snippet to the original code:
Click ( here) to collapse or open
- Print(req. Headers[' Content-type '])
- Print(req. Encoding)
- Print(req.apparent_encoding)
- Print(Requests. Utils.get_encodings_from_content(page_content. Text))
The output results are:
Text/html
Encoding of Iso-8859-1#response content
Utf-8#response code set in headers
[' Utf-8 '] #response返回的html the code set in the header tag
The returned content is ' iso-8859-1 ', so garbled, and actually we should use ' utf-8 ' encoding
Summary: When the response code is ' iso-8859-1 ', we should first look for the code set by the response header, and if this code does not exist, look at the code for the header setting of the returned HTML, as follows:
Click ( here) to collapse or open
- If req. encoding == ' iso-8859-1 ':
- Encodings = Requests. Utils. Get_encodings_from_content(req. Text)
- If encodings:
- Encoding = Encodings[0]
- Else:
- Encoding = req. apparent_encoding
- Encode_content = req. Content. Decode(encoding, ' replace '). Encode(' utf-8 ') , ' replace ')
Resources:
1. http://blog.csdn.net/a491057947/article/details/47292923
2. http://docs.python-requests.org/en/latest/user/quickstart/#response-content
Python+requests Crawl site encountered Chinese garbled how to do?