For more information about chunked gzip, socket downloads the webpage content.

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When using the underlying socket to download HTML web pages, Java and Python have very good class library operations. If you can use C ++ to write, you must use socket, the underlying operations first establish a socket. However, when sending a Response Request Header, simulate a browser request. You only need to change the User-Agent to the browser name, such as IE, such as Firefox, it can also be a rober, such as the name of a search engine. There are a lot of such items on the Internet and I will not write any more. Here I will only write data extraction and decompression for chunked.

If the request header contains accept-encoding, Gzip, and deflate, And the other server supports gzip data, the server will transmit data as the client in gzip mode, the client browser will decompress the package for us. In this case, the server will carry Content-Length to indicate the length of the data to be sent. The customer's socket will retrieve the data from the response header, as a standard, the number of bytes of data to be received from the server. However, sometimes the server does not carry this response header, but it carries another response header transfer-encoding: chunked, data is transmitted in chunk mode.

The so-called Chunk is in the following format:

The number of bytes of the first chunk data +/R/n + the data of the first chunk + the number of bytes of the data of the second chunk +/R/n + Data + N chunks +/ r/n + 0 +/R/n.

Therefore, when receiving a chunk, You need to first get the length of each byte, Then skip 2 bytes, retrieve the data, and then skip 2 bytes to get the length of the next chunk, until the last chunk, the last chunk must be 0, and the length of the byte must be transmitted in hexadecimal format. It must be converted to decimal. If the data is in GZIP format, after all the data combinations are completed, decompress the package. If the chunk mode is not used for transmission, decompress the package directly.

Google has been posted on the Internet for half a day. The most reposted is the following http://www.donevii.com/post/468.html. However, there is no corresponding processing code. Remember the code I processed here.

If (chunk = true): content = content. lstrip ('/R') content = content. lstrip ('/N') # obtain the hexadecimal length of the first chunk, ending with/R/n. temp = content. find ('/R/N') strtemp = content [0: temp] readbytes = int (strtemp, 16) # convert to decimal newcont = '' Start = 2 offset = temp + 2 newcont ='' # loop processing of all chunks while (readbytes> 0): # obtain this part of data and add it together with the previous data. newcont + = content [offset: readbytes + offset] Offset + = readbytes endtemp = content. find ('/R/N', offset + 2) If (endtemp>-1): strtemp = content [Offset + 2: endtemp] readbytes = int (strtemp, 16) If (readbytes = 0): Break else: offset = endtemp + 2 # replace the previous data. content = newcont # print 'adfafa '+ content Print contenttype try: # extract the data in gzip mode. If the data is chunked, do not perform the preceding operations, directly decompress the package. # extract the package normally. If (contenttype = 'gzip '): compressedstream = stringio. stringio (content) gzipper = gzip. gzipfile (fileobj = compressedstream) content = gzipper. read () handle T ioerror, E: Print E

The above is part of the code for downloading data from the python socket. It is about extracting the length of each chunk and then extracting data. If there is another chunk, process it, until the last chunk ends with/R/n + 0 +/R/N, it indicates that all chunks have been passed.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

For more information about chunked gzip, socket downloads the webpage content.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

For more information about chunked gzip, socket downloads the webpage content.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support