1. Issue background
When crawling Web Data using URLLIB2 module, if you want to use how to request headers, reduce the amount of data transferred. The returned data is gzip compressed. Directly following Content.decode ("UTF8"), the decoding will be an exception, and the actual encoding type of the Web page data cannot be detected.
2. Problem analysis
Because HTTP requests, if the request header contains "accept-encoding": "gzip, deflate", and the Web server side supports, the returned data is compressed, and this benefit is reduced network traffic by the client based on the header , unzip at the client layer, and then decode. URLLIB2 module, gets the HTTP response data is raw data, has not been decompressed, so this is the root cause of garbled.
3. Solution 3.1 Request Header Remove "accept-encoding": "Gzip, deflate"
The fastest solution, can directly get the decoded data, the disadvantage is that the transmission traffic will increase a lot.
3.2 Use Zlib module, unzip, then decode, to get readable plaintext data.
This is also the scenario used in this article
4. Source code parsing
The code below, which is a typical mock form form, post the code that submits the request data, based on Python 2.7
,
code block
code block syntax follows standard markdown code
#!/usr/bin/env python2.7ImportSysImportZlibImportChardetImportUrllibImportUrllib2ImportCookielib def main():Reload (SYS) sys.setdefaultencoding (' Utf-8 ') URL =' Http://xxx.yyy.com/test 'Values = {"Form_field1":"Value1","Form_field2":"TRUE",} Post_data = Urllib.urlencode (values) cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) headers ={"User-agent":"mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0) gecko/20100101 firefox/36.0 ","Referer":"Http://xxx.yyy.com/test0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-language":"en-us,en;q=0.5","Accept-encoding":"gzip, deflate","Connection":"Keep-alive",# "Cookie": "qsession=", "Content-type":"application/x-www-form-urlencoded",} req = Urllib2. Request (url,post_data,headers) response = Opener.open (req) content = Response.read () gzipped = RESPONSE.HEADERS.G Et' content-encoding ')ifgzipped:html = zlib.decompress (content, -+zlib. Max_wbits)Else: HTML = Content result = Chardet.detect (HTML) print (result)PrintHtml.decode ("UTF8")if__name__ = =' __main__ ': Main ()
Using this script requires the following environments
- Mac OS 10.9+
- Python 2.7.x
Directory
Use [TOC]
to generate a directory:
- Issue background
- Problem analysis
- Solution Solutions
- 1 Request header Removal Accept-encodinggzip deflate
- 2 unzip and decode the readable plaintext data using the Zlib module
- SOURCE parsing
When Python requests HTML data in the GZIP header, the response content is garbled and cannot be decoded by the solution