HTTP/1.1 預設的串連方式是長串連,不能通過簡單的TCP串連關閉判斷HttpMessage的結束。
以下是幾種判斷HttpMessage結束的方式:
1. HTTP協議約定status code 為1xx,204,304的應答訊息不能包含訊息體(Message Body), 直接忽略掉訊息實體內容。
[適用於應答訊息]
Http Message =Http Header
2. 如果請求訊息的Method為HEAD,則直接忽略其訊息體。[適用於請求訊息]
Http Message =Http Header
3. 如果Http訊息頭部有“Transfer-Encoding:chunked”,則通過chunk size判斷長度。
4. 如果Http訊息頭部有Content-Length且沒有Transfer-Encoding(如果同時有Content-Length和Transfer-Encoding,則忽略Content-Length),
則通過Content-Length判斷訊息體長度。
5. 如果採用短串連(Http Message頭部Connection:close),則直接可以通過伺服器關閉串連來確定訊息的傳輸長度。
[適用於應答訊息,Http請求訊息不能以這種方式確定長度]
6. 還可以通過接收訊息逾時判斷,但是不可靠。Python Proxy實現的httpProxy 伺服器用到了逾時機制,源碼地址見References[7],僅100多行。
HTTP協議規範RFC 2616的4.4 Message Length中對相關內容有較多的描述(https://tools.ietf.org/html/rfc2616#section-4.4)。
一個執行個體,Python標準庫httplib.py源碼解讀(http協議用戶端的實現)
httplib最簡單的使用方法:
import httplibconn = httplib.HTTPConnection("google.com")conn.request('GET', '/')print conn.getresponse().read()conn.close()
但是一般不直接使用httplib,而是使用更高層的封裝urllib,urllib2
conn = httplib.HTTPConnection("google.com")建立HTTPConnection對象,指定要請求的webserver.
conn.request('GET', '/')向google.com發送http請求,Method為GET
conn.getresponse()建立HTTPResponse對象,接收並讀取http應答訊息頭,read()讀取應答訊息體。
函數調用關係:
getresponse()->[建立HTTPResponse對象response]-> response.begin()->response.read()
重點是begin()和read(),begin()完成了4件事:
(1)建立HTTPMessage對象並解析Http應答訊息的頭部。
(2)查看頭部是否有“Transfer-Encoding:chunked”。
(3)查看接收完應答訊息後是否關閉TCP串連(調用_check_close())。
(4)如果頭部有“Content-Length”並且沒有“Transfer-Encoding:chunked”,則擷取訊息體長度。
_check_close()判斷若Http應答訊息頭部有“Connection:close”則接收完應答訊息後關閉TCP串連,同時還有一些向後相容HTTP/1.0的代碼。HTTP/1.1預設是“Connection:Keep-Alive”,即使頭部中沒有。
read()根據Content-Length或chunked分塊方式讀取Http應答訊息體,可一次全部讀取也可以指定要讀取的位元組數。如果是chunked方式,調用_read_chunked()讀取。
_read_chunked()根據chunksize讀取chunks,當讀取完最後一個chunk(最後一個chunk的chunksize
= 0)後就完成了Http應答訊息的接收。相關的HTTP協議規範參考RFC2616 3.6.1,RFC2616
19.4.6
RFC 2616 19.4.6有一段如何解析chunked方式的Http訊息的虛擬碼:
length:= 0readchunk-size, chunk-extension (if any) and CRLF
while(chunk-size > 0) {
read chunk-data and CRLF
append chunk-data to entity-body
length := length + chunk-size
read chunk-size and CRLF
}
readentity-header
while(entity-header not empty) {
append entity-header to existing headerfields
read entity-header
}
Content-Length:= length
Remove"chunked" from Transfer-Encoding
來看一下begin(),_check_close(),read(),_read_chunked()的主要代碼:
(1)begin():
def begin(self):...... self.msg = HTTPMessage(self.fp, 0) # don't let the msg keep an fp self.msg.fp = None # are we using the chunked-style of transfer encoding? tr_enc = self.msg.getheader('transfer-encoding') if tr_enc and tr_enc.lower() == "chunked": self.chunked = 1 self.chunk_left = None else: self.chunked = 0 # will the connection close at the end of the response? self.will_close = self._check_close() # do we have a Content-Length? # NOTE: RFC 2616, S4.4, #3 says we ignore this if tr_enc is "chunked" length = self.msg.getheader('content-length') if length and not self.chunked: try: self.length = int(length) except ValueError: self.length = None else: if self.length < 0: # ignore nonsensical negative lengths self.length = None else: self.length = None # does the body have a fixed length? (of zero) # NO_CONTENT = 204, NOT_MODIFIED = 304 #判斷Http Response Message 結束,見本文開頭總結的第1點 if (status == NO_CONTENT or status == NOT_MODIFIED or 100 <= status < 200 or # 1xx codes self._method == 'HEAD'): self.length = 0 # if the connection remains open, and we aren't using chunked, and # a content-length was not provided, then assume that the connection # WILL close. #判斷Http Response Message 結束,如果沒有chunked和Content-Length都沒有使用,就關閉串連 if not self.will_close and \ not self.chunked and \ self.length is None: self.will_close = 1
(2)_check_close():
def _check_close(self): #判斷Http Response Message 結束,見本文開頭總結的第5點 conn = self.msg.getheader('connection') if self.version == 11: # An HTTP/1.1 proxy is assumed to stay open unless # explicitly closed. conn = self.msg.getheader('connection') if conn and "close" in conn.lower(): return True return False # Some HTTP/1.0 implementations have support for persistent # connections, using rules different than HTTP/1.1. # For older HTTP, Keep-Alive indicates persistent connection. if self.msg.getheader('keep-alive'): return False # At least Akamai returns a "Connection: Keep-Alive" header, # which was supposed to be sent by the client. if conn and "keep-alive" in conn.lower(): return False # Proxy-Connection is a netscape hack. pconn = self.msg.getheader('proxy-connection') if pconn and "keep-alive" in pconn.lower(): return False # otherwise, assume it will close return True
(3)read():
def read(self, amt=None): if self.fp is None: return '' if self._method == 'HEAD': self.close() return '' if self.chunked: return self._read_chunked(amt) if amt is None: # unbounded read if self.length is None: s = self.fp.read() else: try: s = self._safe_read(self.length) except IncompleteRead: self.close() raise self.length = 0 self.close() # we read everything return s if self.length is not None: if amt > self.length: # clip the read to the "end of response" amt = self.length # we do not use _safe_read() here because this may be a .will_close # connection, and the user is reading more bytes than will be provided # (for example, reading in 1k chunks) s = self.fp.read(amt) if not s: # Ideally, we would raise IncompleteRead if the content-length # wasn't satisfied, but it might break compatibility. self.close() if self.length is not None: #計算剩餘長度,供下次讀取 self.length -= len(s) if not self.length: self.close() return s
(4) _read_chunked():
def _read_chunked(self, amt): assert self.chunked != _UNKNOWN # self.chunk_left is None when reading chunk for the first time(see self.begin()) #chunk_left :bytes left in certain chunk #chunk_left = None means that reading hasn't been started. chunk_left = self.chunk_left value = [] while True: if chunk_left is None: # read a new chunk line = self.fp.readline(_MAXLINE + 1) if len(line) > _MAXLINE: raise LineTooLong("chunk size") i = line.find(';') if i >= 0: line = line[:i] # strip chunk-extensions try: chunk_left = int(line, 16) except ValueError: # close the connection as protocol synchronisation is # probably lost self.close() raise IncompleteRead(''.join(value)) if chunk_left == 0: ##RFC 2661 3.6.1 last-chunk chunk_left = 0 break if amt is None: value.append(self._safe_read(chunk_left)) elif amt < chunk_left: value.append(self._safe_read(amt)) self.chunk_left = chunk_left - amt return ''.join(value) elif amt == chunk_left: value.append(self._safe_read(amt)) self._safe_read(2) # toss the CRLF at the end of the chunk self.chunk_left = None return ''.join(value) else: value.append(self._safe_read(chunk_left)) amt -= chunk_left # we read the whole chunk, get another self._safe_read(2) # toss the CRLF at the end of the chunk chunk_left = None ...... # we read everything; close the "file" self.close() return ''.join(value)
另一個實際的源碼,PythonProxy中,到達逾時時間後停止接收訊息。_read_write()讀取和寫入已開啟的socket。
def _read_write(self): time_out_max = self.timeout/3 socs = [self.client, self.target] count = 0 while 1: count += 1 # time_out = 3 (recv, _, error) = select.select(socs, [], socs, 3) if error: break if recv: for in_ in recv: data = in_.recv(BUFLEN) if in_ is self.client: out = self.target else: out = self.client if data: out.send(data) count = 0 #連續time_out_max次未接收到資料就停止接收和發送[逾時了] if count == time_out_max: break
有了上面的分析和源碼,這個問題應該很好回答了:
當HTTP採用keepalive模式,當伺服器響應用戶端的請求後,用戶端如何判斷接收到的Http ResponseMessage已經接收完成?
最後,再附上stackoverflow上一個關於如何判斷Http Message結束的回答:
References
[1]Hypertext Transfer Protocol -- HTTP/1.1
https://tools.ietf.org/html/rfc2616
[2]Detect end of HTTP request body
http://stackoverflow.com/questions/4824451/detect-end-of-http-request-body
[3]Detect the end of a HTTP packet
http://stackoverflow.com/questions/3718158/detect-the-end-of-a-http-packet
[4] 判斷Keep-Alive模式的HTTP請求的結束
http://blog.quanhz.com/archives/141
[5] 這樣被判了死刑!
http://www.cnblogs.com/skynet/archive/2010/12/11/1903347.html
[6]雜談Nginx與HTTP協議
http://blog.xiuwz.com/tag/content-length/
[7]Python Proxy- A Fast HTTP proxy
https://code.google.com/p/python-proxy/
[8] python基於http協議編程:httplib,urllib和urllib2
http://www.cnblogs.com/chenzehe/archive/2010/08/30/1812995.html
轉載本文請註明作者和出處[Gary的影響力]http://garyelephant.me,請勿用於任何商業用途!
Author: Gary Gao( garygaowork[at]gmail.com) 關注互連網、分布式、高效能、NoSQL、自動化、軟體團隊