Python crawler: HTTP request Header (header) detailed

Source: Internet
Author: User
Tags http authentication local time ranges unsupported

This article is based on RFC2616 (http/1.1 specification), reference

http://www.w3.org/Protocols/rfc2068/rfc2068

http://www.w3.org/Protocols/rfc2616/rfc2616

Http://www.ietf.org/rfc/rfc3229.txt

Typically HTTP messages include client-to-server request messages and server-to-client response messages. These two types of messages consist of a starting line, one or more header fields, a blank line that is just the end of the head field, and an optional message body. The header fields of HTTP include the general header, the request header, the response header, and the four parts of the entity header. Each header field consists of a domain name, a colon (:), and a domain value of three parts. Domain names are case-insensitive, you can add any number of whitespace before the domain value, and the header field can be expanded to multiple lines, at the beginning of each line, with at least one space or tab.

Universal header domain (general headers)

The generic header domain contains the header fields that both the request and response messages support, providing the most basic information related to the message, and the generic header domain contains

Connection allows clients and servers to specify options related to request/response connections

Date and time flag indicating when the message was created

Mime-version gives the MIME version used by the sending side

Trailer If the message uses a chunked transmission encoding (chunked transfer encoding), you can use this header to list the header set in the message trailer (Trailer) section

Transfer-encoding tells the receiving end to ensure the reliable transmission of the message, the use of the message encoding method

Upgrade gives the new version and protocol that the sending side might want to "upgrade"

Via shows the intermediary node through which the message was passed (proxy, web un)

The expansion of the universal header domain requires both parties to support this extension, and if there is an unsupported universal header domain, it will generally be handled as the entity header domain. The following is a brief introduction to several common header domains used in UPnP messages.


Cache-control Header Field

CACHE-CONTROL Specifies the caching mechanism that requests and responses follow. Setting Cache-control in a request message or response message does not modify the caching process in another message processing process. Cache directives on request include No-cache, No-store, Max-age, Max-stale, Min-fresh, only-if-cached, directives in response messages including public, private, No-cache, no- Store, No-transform, Must-revalidate, Proxy-revalidate, Max-age. The instructions in each message have the following meanings:

Public indicates that the response can be cached by any buffer.

Private indicates that the entire or partial response message for a single user cannot be shared with the cache. This allows the server to simply describe a partial response message for the user, and this response message is not valid for another user's request.

No-cache indicates that a request or response message cannot be cached

No-store is used to prevent the inadvertent release of important information. Sending in the request message will make the request and response messages do not use the cache.

Max-age indicates that the client can receive a response that is not longer than the specified time (in seconds).

Min-fresh indicates that the client can receive a response that is less than the current time plus a specified time.

Max-stale indicates that the client can receive a response message that exceeds the timeout period. If you specify a value for the Max-stale message, the client can receive a response message that exceeds the specified value for the timeout period.


Date Header Field

The Date header field represents the time the message was sent, and the time description format was defined by RFC822. For example, Date:mon,31dec200104:25:57gmt. The time described by date represents the world standard, which translates into local time and needs to know the time zone in which the user is located.

pragma header field

The pragma header domain is used to contain implementation-specific instructions, most commonly pragma:no-cache. In the http/1.1 protocol, it has the same meaning as Cache-control:no-cache.

Request Message

The first behavior of the request message is in the following format:

Methodsprequest-urisphttp-versioncrlfmethod indicates that the field is case-sensitive for the method Request-uri completed, including options,, POST, PUT, DELETE , TRACE. The method get and head should be supported by all common Web servers, and the implementation of all other methods is optional. The GET method retrieves the information identified by the Request-uri. The head method also retrieves the information identified by the Request-uri, but does not return the body of the message when the response is available. The Post method can request that the server receive entity information contained in the request, and can be used to submit the form, sending messages to newsgroups, BBS, mail groups, and databases.

The SP represents a space. Request-uri follows the URI format, where the word Cheweishing (*) indicates that the request is not used for a particular resource address, but rather for the server itself. Http-version represents the supported HTTP version, for example, http/1.1. The CRLF represents a newline carriage return character. The request header domain allows the client to pass additional information about the request or about the client to the server. The Request header field may contain the following fields Accept, Accept-charset, accept-encoding, Accept-language, Authorization, from, Host, If-modified-since, If-match, If-none-match, If-range, If-range, If-unmodified-since, Max-forwards, Proxy-authorization, Range, Referer, User-agent. Extensions to the request header domain are supported by both parties, and if an unsupported request header domain exists, it will generally be handled as the entity header domain.

A typical request message:

GET Http://download.google.com/somedata.exe
Host:download.google.com
accept:*/*
Pragma:no-cache
Cache-control:no-cache
referer:http://download.google.com/
User-agent:mozilla/4.04[en] (win95;i; NAV)
range:bytes=554554-

The first line in the previous example indicates that the HTTP client (possibly a browser, downloader) obtains the file under the specified URL through the Get method. The brown portion represents the information for the Request header field, and the green section represents the General header section.

Host Header Field

The host header domain specifies the intenet host and port number of the requesting resource, and must represent the location of the originating server or gateway that requested the URL. The http/1.1 request must contain the host header domain or the system will return with a 400 status code.

referer Header Field

The Referer header domain allows the client to specify the source resource address of the request URI, which allows the server to generate a fallback list that can be used to log in, optimize the cache, and so on. He also allows the abolition or wrong connection to be traced for maintenance purposes. If the requested URI does not have its own URI address, Referer cannot be sent. If you specify a partial URI address, this address should be a relative address.

Range Header Field

The Range header field can request one or more child ranges of an entity. For example
Represents the first 500 bytes: bytes=0-499
Represents a second 500 byte: bytes=500-999
Represents the last 500 bytes: bytes=-500
Represents the range after 500 bytes: bytes=500-
First and last byte: Bytes=0-0,-1
Specify several ranges at the same time: bytes=500-600,601-999

However, the server can ignore this request header, and if the unconditional get contains a range request header, the response is returned as a status code of 206 (partialcontent) instead of a (OK).

user-agent Header Field

The contents of the User-agent header domain contain the user information that made the request.

Response Message

The first behavior of the response message is in the following format:

Http-versionspstatus-codespreason-phrasecrlf

Http-version represents the supported HTTP version, for example, http/1.1. Status-code is a result code of three numbers. Reason-phrase provides a simple text description for Status-code. Status-code is mainly used for machine automatic identification, reason-phrase is mainly used to help users understand. The first number of Status-code defines the category of the response, and the latter two numbers do not have a role to classify. The first number can take 5 different values:

1XX: Information response class, which indicates receipt of request and continues processing

2XX: Handle the successful response class, indicating that the action was successfully received, understood, and accepted

3XX: Redirect Response class, must accept further processing in order to complete the specified action

4XX: Client error, client request contains syntax error or is not executed correctly

5XX: Server error, servers do not correctly execute a correct request

The Response header field allows the server to pass additional information that cannot be placed on the status line, which primarily describes the server's information and Request-uri further information. The Response header field contains age, location, proxy-authenticate, public, Retry-after, Server, Vary, Warning, and Www-authenticate. The expansion of the response header field is required for both sides of the communication, and if there is an unsupported response header field, it will generally be handled as the Entity header field.

A typical response message:

Http/1.0200ok
Date:mon,31dec200104:25:57gmt
server:apache/1.3.14 (Unix)
Content-type:text/html
Last-modified:tue,17apr200106:46:28gmt
Etag: "a030f020ac7c01:1e9f"
content-length:39725426
content-range:bytes554554-40279979/40279980

The first line in the previous example represents an HTTP service-side response to a GET method. The brown part represents the Response header field information, the green part represents the General header section, and the red part represents the Entity header field information.

Location response Header

The location response header is used to redirect the recipient to a new URI address.

Server response Header

The server response header contains software information for the originating server that processed the request. This field can contain multiple product identifiers and annotations, and product identities are generally sorted by importance.

Entity

Both the request message and the response message can contain entity information, which generally consists of entity header fields and entities. The Entity header field contains the original information about the entity, including allow, Content-base, content-encoding, Content-language, Content-length, Content-location, CONTENT-MD5, Content-range, Content-type, Etag, Expires, Last-modified, Extension-header. Extension-header allows clients to define new entity headers, but these domains may not be recognized by the recipient. An entity can be a coded stream of bytes encoded by content-encoding or Content-type, whose length is defined by content-length or Content-range.

Content-type Solid Head

The Content-type entity header is used to indicate the media type of the entity to the receiver, specify the entity media type that the head method sends to the receiver, or the request media type that the Get method sends Content-range entity header

The Content-range entity header is used to specify the insertion position of a part of the entire entity, and he also indicates the length of the entire entity. When the server returns a partial response to the customer, it must describe the extent of the response coverage and the entire length of the entity. General format:

Content-range:bytes-unitspfirst-byte-pos-last-byte-pos/entity-legth

For example, the transfer header is in the form of a 500-byte secondary field: content-range:bytes0-499/1234 If an HTTP message contains this section (for example, a response to a range request or an overlapping request to a range of ranges), Content-range represents the range of the transfer, The content-length represents the number of bytes actually transferred.

last-modified Solid Head

Answer Header Description
Allow Which request methods are supported by the server (such as GET, post, etc.).
Content-encoding The encoding (Encode) method of the document. The content type specified by the Content-type header can be obtained only after decoding. Using gzip to compress documents can significantly reduce the download time of HTML documents. Java's gzipoutputstream can be easily gzip compressed, but only on Unix Netscape and IE 4, ie 5 on Windows. Therefore, the servlet should check whether the browser supports gzip by looking at the accept-encoding header (that is, Request.getheader ("accept-encoding")). Returns the gzip-compressed HTML page for a browser that supports gzip, returning a normal page for another browser.
Content-length Represents the content length. This data is only required if the browser is using a persistent HTTP connection. If you want to take advantage of the persistent connection, you can write the output document to Bytearrayoutputstram, look at its size when done, then put that value into the Content-length header and finally pass the Bytearraystream.writeto ( Response.getoutputstream () Send content.
Content-type Indicates what MIME type the following document belongs to. The servlet defaults to Text/plain, but it usually needs to be explicitly specified as text/html. Because Content-type is often set up, HttpServletResponse provides a dedicated method Setcontenttyep.
Date The current GMT time. You can use Setdateheader to set this header to avoid the hassle of converting the time format.
Expires When should I think that the document has expired so that it is no longer cached?
Last-modified The last modification time of the document. The customer can provide a date through the If-modified-since request header, which is treated as a conditional get, and only documents that have been modified later than the specified time are returned, otherwise a 304 (not Modified) state is returned. Last-modified can also be set using the Setdateheader method.
Location Indicates where the customer should go to extract the document. Location is usually not set directly, but by HttpServletResponse's Sendredirect method, which sets the status code to 302.
Refresh Indicates how much time the browser should refresh the document, in seconds. In addition to refreshing the current document, you can also pass SetHeader ("Refresh", "5; Url=http://host/path ") lets the browser read the specified page.
Note This functionality is usually done by setting the HTML page in the head area of the
Note that the meaning of refresh is "refresh this page after n seconds or visit the specified page" instead of "refresh this page every n seconds or visit the specified page". Therefore, continuous refresh requires a refresh header to be sent each time, and sending a 204 status code prevents the browser from continuing to refresh, whether it is using the refresh header or the

Note that the refresh header is not part of the HTTP 1.1 formal specification, but rather an extension, but both Netscape and IE support it.
Server Server name. The servlet generally does not set this value, but is set by the Web server itself.
Set-cookie Sets the cookie associated with the page. The servlet should not use Response.setheader ("Set-cookie", ...), but should use the dedicated method Addcookie provided by HttpServletResponse. See below for a discussion of cookie settings.
Www-authenticate What type of authorization information should the customer provide in the authorization header? This header is required in an answer that contains a 401 (unauthorized) status line. For example, Response.setheader ("Www-authenticate", "BASIC realm=\" Executives\ "").
Note that the servlet generally does not handle this, but instead gives the Web server a special mechanism to control access to password-protected pages (for example,. htaccess).
In PHP, the header function is 1. Page not found not Found
  
Header (' http/1.1 404 Not Found ');
  
2. Use this header command to resolve the 404 header generated by the URL rewrite
  
Header (' http/1.1 OK ');
  
3. Restricted access
  
Header (' http/1.1 403 Forbidden ');
  
The page moved permanently should is used for
  
All redrictions, because search engines know
  
What's going on and can easily update their URLs.
  
4. The page is permanently deleted, can tell the search engine to update their URLs
  
Header (' http/1.1 301 Moved permanently ');
  
5. Server Error
  
Header (' http/1.1 Internal Server Error ');
  
6. Redirect to a new location
  
Header (' Location:. example.org/');
  
7. Delay after a period of redirection
  
Header (' Refresh:10; url=.example.org/');
  
Echo ' You'll be redirected in ten seconds ';
  
8. Load the file you want to download:
  
ReadFile (' Example.zip ');
  
9. You can also use HTML syntax to implement latency
  
Header (' content-transfer-encoding:binary ');
  
10. Disable caching of the current document:
  
Header (' Cache-control:no-cache, No-store, max-age=0, Must-revalidate ');
  
Header (' Expires:mon, 05:00:00 GMT ');
  
Header (' Pragma:no-cache ');
  
11. Display the Login dialog box, which can be used for HTTP authentication
  
Header (' http/1.1 401 Unauthorized ');
  
Header (' Www-authenticate:basic realm= ' Top Secret ');
  
Echo ' Text that would be displayed if the user hits cancel or ';
  
Echo ' enters wrong login data ';
  
12. Set the content type:
  
Header (' content-type:text/html; Charset=iso-8859-1 ');
  
Header (' content-type:text/html; Charset=utf-8 ');
  
Header (' Content-type:text/plain ');//Plain text file
  
Header (' content-type:image/jpeg ');//JPG picture
  
Header (' Content-type:application/zip ');//zip file
  
Header (' content-type:application/pdf ');//pdf file
  
Header (' content-type:audio/mpeg ');//audio MPEG (MP3,... ) file
  
Header (' Content-type:application/x-shockwave-flash ');//Flash animation

Python crawler: HTTP request Header (header) detailed

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.