When Python requests HTML data in the GZIP header, the response content is garbled and cannot be decoded by the solution

Source: Internet
Author: User

1. Issue background

When crawling Web Data using URLLIB2 module, if you want to use how to request headers, reduce the amount of data transferred. The returned data is gzip compressed. Directly following Content.decode ("UTF8"), the decoding will be an exception, and the actual encoding type of the Web page data cannot be detected.

2. Problem analysis

Because HTTP requests, if the request header contains "accept-encoding": "gzip, deflate", and the Web server side supports, the returned data is compressed, and this benefit is reduced network traffic by the client based on the header , unzip at the client layer, and then decode. URLLIB2 module, gets the HTTP response data is raw data, has not been decompressed, so this is the root cause of garbled.

3. Solution 3.1 Request Header Remove "accept-encoding": "Gzip, deflate"

The fastest solution, can directly get the decoded data, the disadvantage is that the transmission traffic will increase a lot.

3.2 Use Zlib module, unzip, then decode, to get readable plaintext data.

This is also the scenario used in this article

4. Source code parsing

The code below, which is a typical mock form form, post the code that submits the request data, based on Python 2.7
,

code block

code block syntax follows standard markdown code

#!/usr/bin/env python2.7ImportSysImportZlibImportChardetImportUrllibImportUrllib2ImportCookielib def main():Reload (SYS) sys.setdefaultencoding (' Utf-8 ') URL =' Http://xxx.yyy.com/test 'Values = {"Form_field1":"Value1","Form_field2":"TRUE",} Post_data = Urllib.urlencode (values) cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) headers ={"User-agent":"mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0) gecko/20100101 firefox/36.0 ","Referer":"Http://xxx.yyy.com/test0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-language":"en-us,en;q=0.5","Accept-encoding":"gzip, deflate","Connection":"Keep-alive",# "Cookie": "qsession=",              "Content-type":"application/x-www-form-urlencoded",} req = Urllib2. Request (url,post_data,headers) response = Opener.open (req) content = Response.read () gzipped = RESPONSE.HEADERS.G Et' content-encoding ')ifgzipped:html = zlib.decompress (content, -+zlib. Max_wbits)Else: HTML = Content result = Chardet.detect (HTML) print (result)PrintHtml.decode ("UTF8")if__name__ = =' __main__ ': Main ()

Using this script requires the following environments
- Mac OS 10.9+
- Python 2.7.x

Directory

Use [TOC] to generate a directory:

    • Issue background
    • Problem analysis
    • Solution Solutions
      • 1 Request header Removal Accept-encodinggzip deflate
      • 2 unzip and decode the readable plaintext data using the Zlib module
    • SOURCE parsing
        • code block
        • Directory

When Python requests HTML data in the GZIP header, the response content is garbled and cannot be decoded by the solution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.