Python crawlers encounter status code 304,705, python304
What is the 304 status code?
If the client sends a GET request with a condition and the request has been allowed, and the content of the document (since the last access or according to the condition of the request) has not changed, the server should return this 304 status code. The simple expression is that the client has executed GET but the file has not changed.
Under what circumstances will 304 status code be returned? How does the client know that the content is not updated? In fact, this is not about the client, but about your server. We all know that the server can be configured with a cache mechanism to speed up Website access, when you send a GET request, the server will call the content you want to access from the cache. At this time, the server can determine whether the page has been updated, if it has not been updated, it will return a 304 status code. For example, how do some search engines know whether our website is updated. The most direct way to determine whether a webpage changes is to set one of the pages as a monitoring area and capture the content of this area each time, then compare it with the locally saved or the last captured content. If there is a difference, it indicates that the webpage has changed before it can be parsed. This method is relatively secure and can achieve almost foolproof results. However, this method downloads the page content during each scan, captures the content in the monitoring area, and finally performs string comparison. The whole process is time-consuming. In fact, among many webpages, some websites have static pages, slice, html, and js. These static pages may already be prepared by the server, A user only downloads data during access. For such static pages, you can only use the 304 status code to determine whether the content has changed. How can this problem be solved? If the client sends a GET request with a condition and the request has been allowed, and the content of the document (since the last access or according to the condition of the request) has not changed, the server should return this status code. 304 the response is forbidden to contain the message body. Therefore, it always ends with the first blank line after the message header. The response must contain the following header information: Date, unless the server has no clock. If a server without a clock complies with these rules, the proxy server and the client can add the Date field to the received Response Header (as specified in RFC 2068 ), the cache mechanism will work normally. ETag and/or Content-Location. If the same request should have returned a 200 response. Expires, Cache-Control, and/or Vary, if the value may be different from the value corresponding to other responses of the same variable. If this Response Request uses strong cache verification, this response should not contain other entity headers; otherwise (for example, a conditional GET request uses weak cache verification ), this response prohibits the inclusion of other object headers. This avoids inconsistency between cached object content and updated object header information. If a 304 response indicates that an object is not cached, the cache system must ignore the response and repeatedly send requests that do not contain restrictions. If a 304 response is received to update a cache entry, the cache system must update the entire entry to reflect the value of all the fields updated in the response, the client will provide the server with an If-Modified-Since request header, whose value is the Date value in the Last returned Last-Modified response header, it also provides an If-None-Match request header with the value of the ETag Response Header returned by the server last time. When the website status code is 304, the crawler returns the status information of 705. The connection between the WAP Gateway and the remote server fails. Reference status code information: http://tool.oschina.net/commons? Type = 5 https://wenku.baidu.com/view/4e06018483d049649b66581c.html