When using the underlying socket to download HTML web pages, Java and Python have very good class library operations. If you can use C ++ to write, you must use socket, the underlying operations first establish a socket. However, when sending a Response Request Header, simulate a browser request. You only need to change the User-Agent to the browser name, such as IE, such as Firefox, it can also be a rober, such as the name of a search engine. There are a lot of such items on the Internet and I will not write any more. Here I will only write data extraction and decompression for chunked.
If the request header contains accept-encoding, Gzip, and deflate, And the other server supports gzip data, the server will transmit data as the client in gzip mode, the client browser will decompress the package for us. In this case, the server will carry Content-Length to indicate the length of the data to be sent. The customer's socket will retrieve the data from the response header, as a standard, the number of bytes of data to be received from the server. However, sometimes the server does not carry this response header, but it carries another response header transfer-encoding: chunked, data is transmitted in chunk mode.
The so-called Chunk is in the following format:
The number of bytes of the first chunk data +/R/n + the data of the first chunk + the number of bytes of the data of the second chunk +/R/n + Data + N chunks +/ r/n + 0 +/R/n.
Therefore, when receiving a chunk, You need to first get the length of each byte, Then skip 2 bytes, retrieve the data, and then skip 2 bytes to get the length of the next chunk, until the last chunk, the last chunk must be 0, and the length of the byte must be transmitted in hexadecimal format. It must be converted to decimal. If the data is in GZIP format, after all the data combinations are completed, decompress the package. If the chunk mode is not used for transmission, decompress the package directly.
Google has been posted on the Internet for half a day. The most reposted is the following http://www.donevii.com/post/468.html. However, there is no corresponding processing code. Remember the code I processed here.
If (chunk = true): <br/> content = content. lstrip ('/R') <br/> content = content. lstrip ('/N') <br/> # obtain the hexadecimal length of the first chunk, ending with/R/n. <br/> temp = content. find ('/R/N') <br/> strtemp = content [0: temp] <br/> readbytes = int (strtemp, 16) # convert to decimal <br/> newcont = ''<br/> Start = 2 <br/> offset = temp + 2 <br/> newcont ='' <br/> # loop processing of all chunks <br/> while (readbytes> 0): <br/> # obtain this part of data and add it together with the previous data. <br/> newcont + = content [offset: readbytes + offset] <br/> Offset + = readbytes <br/> endtemp = content. find ('/R/N', offset + 2) <br/> If (endtemp>-1): <br/> strtemp = content [Offset + 2: endtemp] <br/> readbytes = int (strtemp, 16) <br/> If (readbytes = 0): <br/> Break <br/> else: <br/> offset = endtemp + 2 <br/> # replace the previous data. <br/> content = newcont <br/> # print 'adfafa '+ content <br/> Print contenttype <br/> try: <br/> # extract the data in gzip mode. If the data is chunked, do not perform the preceding operations, directly decompress the package. <br/> # extract the package normally. <br/> If (contenttype = 'gzip '): <br/> compressedstream = stringio. stringio (content) <br/> gzipper = gzip. gzipfile (fileobj = compressedstream) <br/> content = gzipper. read () </P> <p> handle T ioerror, E: <br/> Print E
The above is part of the code for downloading data from the python socket. It is about extracting the length of each chunk and then extracting data. If there is another chunk, process it, until the last chunk ends with/R/n + 0 +/R/N, it indicates that all chunks have been passed.