Python crawler: HTTP request Header (header) detailed

Source: Internet
Author: User
Tags http authentication local time unsupported windows support

Turn from: http://www.cnblogs.com/yizhenfeng168/p/7078480.html
In this paper, according to RFC2616 (http/1.1 specification), reference

http://www.w3.org/Protocols/rfc2068/rfc2068

http://www.w3.org/Protocols/rfc2616/rfc2616

Http://www.ietf.org/rfc/rfc3229.txt

Typically, HTTP messages include the client's request message to the server and the server's response message to the client. The two types of messages consist of a starting row, one or more header fields, and an empty row and an optional message body that is just the end of the head field. The HTTP header domain includes a universal header, a request header, a response header, and an entity header four parts. Each header field consists of a domain name, a colon (:), and a field value of three parts. Domain names are case-insensitive, field values can be added to any number of spaces, and header fields can be extended to multiple lines, at the beginning of each line, with at least one space or tab.

Universal header field (Universal headers)

The Common header field contains the header fields supported by both the request and response messages, providing the most basic information related to the message, and the Common header field contains

Connection allows clients and servers to specify options related to a request/response connection

Date and time flags are provided indicating when the message was created

Mime-version gives the MIME version used by the sender

Trailer If the message uses a block-transfer encoding (chunked transfer encoding), you can use this header to list the header set in the message trailer (Trailer) section

Transfer-encoding tells the receiving end what encoding method is used to ensure the reliable transmission of the message.

Upgrade gives the new version and protocol that the sender may want to "upgrade"

Via shows the intermediate node of the message passing through (proxy, network GA un)

Extensions to the common header domain require both sides of the communication to support this extension, and if there are unsupported common header domains, they will normally be treated as entity headers. The following is a brief introduction to a few common header domains used in UPnP messages.

Cache-control header Field

CACHE-CONTROL Specifies the caching mechanism that the request and response follow. Setting Cache-control in a request message or in a response message does not modify the caching process during another message handling process. The cache directives requested include No-cache, No-store, Max-age, Max-stale, Min-fresh, only-if-cached, and the instructions in the response message include public, private, No-cache, no- Store, No-transform, Must-revalidate, Proxy-revalidate, Max-age. The instructions in each message have the following meanings:

Public indicates that the response can be cached by any buffer.

Private indicates that the entire or partial response message for a single user cannot be handled by the shared cache. This allows the server to simply describe a partial response message from the user, which is not valid for other users ' requests.

No-cache indicates that a request or response message cannot be cached

No-store is used to prevent important information from being inadvertently released. Sending in a request message will not use caching for both request and response messages.

Max-age indicates that the client can receive a response that is not longer than the specified time in seconds.

Min-fresh indicates that the client can receive response times that are less than the current time plus a specified time.

Max-stale indicates that the client can receive response messages that exceed the timeout period. If you specify a value for the Max-stale message, the client can receive a response message that exceeds the specified value for the timeout period.

Date Header Field

The Date header field represents the time when the message was sent, and the description format of the time is defined by rfc822. For example, Date:mon,31dec200104:25:57gmt. When the time represented by the date represents the world standard, it is converted to local time and needs to know the time zone in which the user is located.

pragma header field

The Pragma header field is used to contain implementation-specific instructions, most commonly pragma:no-cache. In the http/1.1 protocol, it has the same meaning as Cache-control:no-cache.

Request message

The first behavior of the request message is in the following format:

Methodsprequest-urisphttp-versioncrlfmethod indicates that the field is case sensitive for Request-uri completion, including options,, POST, put, DELETE , TRACE. The method get and head should be supported by all common Web servers, and the implementation of all other methods is optional. The GET method retrieves the information identified by the Request-uri. The head method is also to retrieve the information identified by the Request-uri, except that the message body is not returned when responding. The Post method can request the server to receive entity information contained in the request, which can be used to submit the form and send messages to newsgroups, BBS, mail groups, and databases.

The SP represents a space. Request-uri follows the URI format, which, when Cheweishing (*), describes the request not for a particular resource address, but for the server itself. Http-version represents a supported version of HTTP, for example, http/1.1. CRLF represents a line feed return character. The request header domain allows the client to deliver additional information about the request or about the client to the server. The Request header field may contain the following fields Accept, Accept-charset, accept-encoding, Accept-language, Authorization, from, Host, If-modified-since, If-match, If-none-match, If-range, If-range, If-unmodified-since, Max-forwards, Proxy-authorization, Range, Referer, User-agent. The extension of the Request header field requires both communication support, and if an unsupported request header field exists, it will normally be treated as an entity header domain.

A typical request message:

Get Http://download.google.com/somedata.exe
Host:download.google.com
accept:/
Pragma:no-cache
Cache-control:no-cache
referer:http://download.google.com/
User-agent:mozilla/4.04en
range:bytes=554554-

The first line of the previous example indicates that the HTTP client (possibly the browser, the download program) obtains the file under the specified URL through the Get method. The brown section represents the information for the Request header field, and the green section represents the generic head section.

Host Header Field

The Host header field specifies the Intenet host and port number of the requesting resource, and must represent the location of the original server or gateway that requested the URL. The http/1.1 request must contain a host header domain or the system will return with a 400 status code.

Referer header Field

The Referer header field allows the client to specify the source resource address of the request URI, which allows the server to generate a fallback list that can be used to log in, optimize cache, and so on. He also permits the abolition or wrong connections to be tracked because of maintenance purposes. If the requested URI does not have its own URI address, Referer cannot be sent. If a partial URI address is specified, this address should be a relative address.

Range header Field

The Range header field can request one or more child scopes for an entity. For example
Represents the first 500 bytes: bytes=0-499
Represents a second 500 byte: bytes=500-999
Represents last 500 bytes: bytes=-500
Represents 500 bytes after range: bytes=500-
First and last byte: Bytes=0-0,-1
Specify several ranges at the same time: bytes=500-600,601-999

However, the server can ignore this request header, and if the unconditional get contains the range request header, the response is returned in status Code 206 (partialcontent) instead of (OK).

User-agent header Field

The contents of the User-agent header field contain the user information that made the request.

Response message

The first behavior of the response message is in the following format:

Http-versionspstatus-codespreason-phrasecrlf

Http-version represents a supported version of HTTP, for example, http/1.1. Status-code is a three-digit result code. Reason-phrase provides a simple text description for Status-code. Status-code is mainly used for automatic machine identification, reason-phrase is mainly used to help users understand. The first number of Status-code defines the category of the response, and the latter two digits do not have a classification effect. The first number may take 5 different values:

1XX: Information response class, which indicates receipt of a request and continues processing

2XX: Handling a successful response class, indicating that the action was successfully received, understood, and accepted

3XX: Redirect Response class, in order to complete the specified action, you must accept further processing

4XX: Client error, customer request contains syntax error or is not properly executed

5XX: Service-side error, server does not perform a correct request correctly

The Response header field allows the server to pass additional information that cannot be placed in the status row, which mainly describes the server's information and Request-uri further information. The Response header field contains age, Location, proxy-authenticate, Public, Retry-after, Server, Vary, Warning, Www-authenticate. The extension of the Response header field requires both communication support and, if an unsupported response header field exists, it will normally be treated as an entity header domain.

Typical response message:

Http/1.0200ok
Date:mon,31dec200104:25:57gmt
server:apache/1.3.14 (Unix)
Content-type:text/html
Last-modified:tue,17apr200106:46:28gmt
Etag: "a030f020ac7c01:1e9f"
content-length:39725426
content-range:bytes554554-40279979/40279980

The first line of the previous example indicates that the HTTP server responds to a GET method. The brown section represents the Response header field information, the green section represents the generic header section, and the red section represents the information for the Entity header field.

Location response Header

The location response header is used to redirect the recipient to a new URI address.

Server response Header

The server response header contains software information for the original server that processed the request. This field can contain multiple product identities and annotations, and product identities are generally sorted by importance.

Entity

Both the request message and the response message can contain entity information, and the entity information is generally composed of the entity header domain and the entity. The Entity header field contains the original information about the entity, and the entity headers include allow, Content-base, content-encoding, Content-language, Content-length, Content-location, CONTENT-MD5, Content-range, Content-type, Etag, Expires, Last-modified, Extension-header. Extension-header allows clients to define new entity headers, but these domains may not be recognized by the recipient. An entity can be an encoded byte stream encoded by content-encoding or content-type, and its length is defined by content-length or Content-range.

Content-type Entity Header

The Content-type entity header is used to indicate the media type of the entity to the receiver, specify the entity media type to which the head method is sent to the receiver, or the request media type sent by the Get method Content-range the entity header

The Content-range entity header is used to specify the insertion position of a portion of the entire entity, and he also indicates the length of the entire entity. When the server returns a partial response to the customer, it must describe the scope of the response coverage and the entire entity length. General format:

Content-range:bytes-unitspfirst-byte-pos-last-byte-pos/entity-legth

For example, the form of a 500-byte secondary field for the transport header: content-range:bytes0-499/1234 If an HTTP message contains this section (for example, a response to a range request or a range of overlapping requests), Content-range represents the range of the transfer, Content-length represents the actual number of bytes transferred.

Last-modified Entity Header
Answer Header Description
The Allow server supports which request methods (such as Get, post, and so on).
The encoding (Encode) method of the Content-encoding document. The content type specified by the Content-type header can be obtained only after decoding. Using gzip compressed documents can significantly reduce the download time for HTML documents. Java Gzipoutputstream can be easily gzip compressed, but only Netscape on Unix and IE 4, IE 5 on Windows support it. Therefore, the servlet should check to see if the browser supports gzip by looking at the accept-encoding header (that is, Request.getheader ("accept-encoding"). Returns a gzip-compressed HTML page for a browser that supports GZIP, returning to a normal page for another browser.
Content-length represents the length of the content. This data is required only if the browser is using persistent HTTP connections. If you want to take advantage of a persistent connection, you can write the output document to Bytearrayoutputstram, view its size after it is finished, and then put the value into the Content-length header, and finally pass Bytearraystream.writeto ( Response.getoutputstream () sends the content.
Content-type indicates what MIME type the following document belongs to. The servlet defaults to Text/plain, but usually needs to be explicitly specified as text/html. Because of the constant need to set up Content-type, HttpServletResponse provides a dedicated method Setcontenttyep.
Date's current GMT time. You can use Setdateheader to set this header to avoid the hassle of converting the time format.
When Expires should think that the document has expired, it will no longer cache it.
Last-modified The last change time for the document. The customer can provide a date through the If-modified-since request header, which is treated as a condition get, and a 304 (not Modified) state is returned only if the document that was modified later than the specified time is returned. Last-modified can also be set using the Setdateheader method.
Location indicates where the customer should go to extract the document. Location is not usually set directly, but through the HttpServletResponse Sendredirect method, which sets the status code to 302.
Refresh indicates how much time the browser should refresh the document, in seconds. In addition to refreshing the current document, you can also pass SetHeader ("Refresh", "5; Url=http://host/path ") lets the browser read the specified page.
Note that this function is usually done by setting the HTML page Head area of <meta http-equiv= "Refresh" content= 5; Url=http://host/path "> This is because automatic refresh or redirection is important for HTML writers who cannot use a CGI or servlet. However, it is more convenient for the servlet to set the refresh header directly.

Note that the meaning of refresh is "refresh this page or access the specified page after n seconds" instead of "refreshing this page or accessing the specified page every n seconds." Therefore, continuous refresh requires a refresh header to be sent each time, and sending a 204 status code can prevent the browser from continuing to refresh, whether using the refresh header or

Note that the refresh header is not part of the HTTP 1.1 formal specification, but is an extension, but Netscape and IE support it.
Server name. The servlet typically does not set this value, but is set by the Web server itself.
Set-cookie settings and page associated cookies. The servlet should not use Response.setheader ("Set-cookie", ...), but should use the specialized method provided by HttpServletResponse Addcookie. See below for a discussion of cookie settings.
Www-authenticate What type of licensing information the customer should provide in the authorization header. This header is required in an answer that contains a 401 (unauthorized) status line. For example, Response.setheader ("Www-authenticate", "BASIC realm=\" executives\ ").
Note that the servlet typically does not do this, but instead gives the Web server a special mechanism to control access to the password-protected page (for example,. htaccess).

Header function in PHP

1. Not found on page Found
  
Header (' http/1.1 404 Not Found ');
  
2. Use this header directive to resolve the 404 header generated by the URL rewrite
  
Header (' http/1.1 OK ');
  
3. Access Restrictions
  
Header (' http/1.1 403 Forbidden ');
  
The page moved permanently should is used for
  
All redrictions, because search engines know
  
What ' s going on and can easily update their URLs.
  
4. Pages are permanently deleted and can tell search engines to update their URLs
  
Header (' http/1.1 moved Permanently ');
  
5. Server Error
  
Header (' http/1.1 Internal Server Error ');
  
6. Redirect to a new location
  
Header (' Location:. example.org/');
  
7. Redirect after a period of delay
  
Header (' Refresh:10; url=.example.org/');
  
Echo ' You are redirected in seconds ';
  
8. Load the file to download:
  
ReadFile (' Example.zip ');
  
9. You can also use HTML syntax to implement latency
  
Header (' content-transfer-encoding:binary ');
  
10. Prohibit caching of current documents:
  
Header (' Cache-control:no-cache, No-store, max-age=0, Must-revalidate ');
  
Header (' Expires:mon, 05:00:00 GMT ');
  
Header (' Pragma:no-cache ');
  
11. Display the Login dialog box, which can be used for HTTP authentication
  
Header (' http/1.1 401 Unauthorized ');
  
Header (' Www-authenticate:basic realm= "top Secret");
  
Echo ' Text that'll be displayed if the user hits cancel or ';
  
Echo ' enters wrong login data ';
  
12. Set Content Type:
  
Header (' content-type:text/html; Charset=iso-8859-1 ');
  
Header (' content-type:text/html; Charset=utf-8 ');
  
Header (' Content-type:text/plain ');//Plain text file
  
Header (' content-type:image/jpeg ');//JPG picture
  
Header (' Content-type:application/zip ');//zip file
  
Header (' content-type:application/pdf ');//pdf file
  
Header (' content-type:audio/mpeg ');//audio MPEG (MP3,... ) file
  
Header (' Content-type:application/x-shockwave-flash ');//Flash animation

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.