Basic knowledge of crawler base-http requests

Source: Internet
Author: User
Tags ranges rfc822

Baidu Encyclopedia on the introduction of reptiles:

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.

Tools used in the development of crawlers: Chrome browser, fiddler tool, Postman plugin.

Address for fiddler knowledge: http://kb.cnblogs.com/page/130367/

The following is the most basic knowledge: HTTP requests. (The following knowledge is derived from: http://www.runoob.com/http/http-intro.html)

definition :

The HTTP protocol is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol), which is used to transfer hypertext to the local browser from the World Wide Web (www:world Wide Web) server. TTP is a TCP/IP communication protocol that transmits data (HTML files, image files, query results, and so on).

Working principle :

The HTTP protocol works on the client-server architecture. The browser sends all requests via URLs to the HTTP server, which is the Web servers, as an HTTP client.

The Web server has: Apache server, IIS server (Internet information Services) and so on.

The Web server sends a response message to the client, based on the received request. The HTTP default port number is 80, but you can also change to 8080 or another port.

HTTP three-point considerations:

    • HTTP is no connection: the meaning of no connection is to limit the processing of only one request per connection. When the server finishes processing the customer's request and receives the customer's answer, the connection is disconnected. In this way, the transmission time can be saved.
    • HTTP is media Independent: This means that any type of data can be sent over HTTP as long as the client and the server know what to do with the data content. The client and server specify that the appropriate Mime-type content type be used.
    • HTTP is stateless: The HTTP protocol is a stateless protocol. Stateless means that the protocol has no memory capacity for transactional processing. A lack of state means that if the previous information is required for subsequent processing, it must be re-routed, which may cause the amount of data to be transferred per connection to increase. On the other hand, it responds faster when the server does not need the previous information.
HTTP Message structure:

HTTP is an architecture model based on client/server (c/s) that exchanges information through a reliable link and is a stateless request/response protocol.

An HTTP "client" is an application (Web browser or any other client) that connects to the server to achieve the purpose of sending one or more HTTP requests to the server.

An HTTP "server" is also an application (typically a Web service, such as an Apache Web server or an IIS server) that receives requests from clients and sends HTTP response data to the client.

HTTP uses a Uniform Resource identifier (Uniform Resource Identifiers, URI) to transfer data and establish a connection.

Once the connection is established, the data message is routed through the format [RFC5322] and Multipurpose Internet Mail Extensions (MIME) [RFC2045] that are used by similar Internet mail.

Structure diagram of the request message:

The client sends an HTTP request to the server with a request message that includes the following format: A request line, a request header (header), a blank line, and four parts of the request data, giving the general format of the request message.

HTTP request Header Detailed: (Ref.: https://kb.cnblogs.com/page/92320/)

Request Header Request Header Properties
Cache-control

Specifies the caching mechanism that requests should follow, with cache directives: No-cache, No-store, Max-age, Max-stale, Min-fresh, only-if-cached.

No-cache: Indicates that the requested message could not be cached

No-store: In the request message, messages that represent requests and responses are not cacheable, in order to prevent the inadvertent disclosure of important messages

Max-age: Indicates the maximum time the client receives the corresponding message (maximum lifetime)

Max-stale: Indicates that the client can accept a timeout message if the value is specified, indicating that a message within the specified value after the timeout can be accepted

Min-fresh: Represents a message that the client can accept the current time plus the specified value for a time

Only-if-cached: Indicates that the client accepts only cached content

Date Represents the time that the message was sent, and the time the description format was defined by RFC822. For example, Date:mon,31dec200104:25:57gmt. The time described by date represents the world standard, which translates into local time and needs to know the time zone in which the user is located.
Pragma Anyway the page is cached, in the http1.1 version, with Cache-control:no-cache, the same function, in http1.0 did not implement CACHE-CONTROL.PRAGMA only one usage: Pragma:no-cache
Host The request header domain is primarily used to specify the Internet host and port number of the requested resource, which is usually extracted from the HTTP URL. If it is not the default port 80, the port is specified. 400 error will be reported if host is not specified
Referer Provide context information to the server and tell the server where I have turned this link from.
Range

Only part of the entity is requested, and the server can ignore the request.

First 500 bytes: bytes=0-499. Second 500 bytes: bytes=500-999. Last 500 bytes: bytes=-500. Range after 500 bytes: bytes=500-. First and last byte: Bytes=0-0,-1. Specify several ranges at the same time: bytes=500-600,601-999

User-agent Represents the user information that contains the send request. If the browser is sent, basically is the browser information
Accept

Represents the type of content that the client can accept, for example: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8

If you want to know more MIME types, go to this site: http://www.w3school.com.cn/media/media_mimeref.asp

Accept-charset Represents a set of character encodings that the browser can accept. The most common in the country is UTF8,GBK. Want to know more: https://zh.wikipedia.org/wiki/%E5%AD%97%E7%AC%A6%E7%BC%96%E7%A0%81
Accept-encoding Specifies that the Web server that the browser can support returns the content compression encoding type. Common: Compress, gzip
Accept-language Specifies the language that the browser can accept. The Common En,zh
Accept-ranges You can request one or more child range fields for a Web page entity. Example: Accept-ranges:bytes
Authorization Authorization Certificate for HTTP Authorization
Connection Whether to maintain a persistent connection. Close: Indicates that a persistent connection is not maintained, Keep-alive: maintain a persistent connection. (HTTP 1.1 defaults to persistent connection)
Cookies Some of the information stored about the client, when sending an HTTP request, sends all cookies under that domain name together to the server.
Content-length Indicates the length of the requested content
Content-type The MIME type of the request entity. If you want to know more MIME types, go to this site: http://www.w3school.com.cn/media/media_mimeref.asp
Expect The specific type of server requested, not too clear, have understood can help me explain
From Email from the user who made the request
If-match Only valid if the request content matches the entity
If-modified-since If the requested part is modified after the specified time, the request succeeds, and the 304 code is returned without modification
If-none-match If the content does not change the return 304 code, the parameter is the server's previously sent ETag, and the server responded to the ETag comparison to determine whether the change
If-range If the entity does not change, the server sends the missing portion of the client, otherwise the entire entity is sent. The parameters are also ETag
If-unmodified-since Request succeeds only if the entity has not been modified since the specified time
Max-forwards Limit the time that information is transmitted through agents and gateways
Proxy-authorization Connect to an authorization certificate for the agent
TE The client is willing to accept the transfer encoding and notifies the server to accept the trailing header information
Upgrade Specify some kind of transport protocol to the server for the server to convert (if supported)
Via Notifies the intermediary gateway or proxy server address, communication protocol
Warning About warning messages for message entities

HTTP response Header Details:

Request Header Request Header Properties
Cache-control

Specifies the caching mechanism that requests should follow, with cache directives: No-cache, No-store, Max-age, Max-stale, Min-fresh, only-if-cached.

No-cache: Indicates that the requested message could not be cached

No-store: In the request message, messages that represent requests and responses are not cacheable, in order to prevent the inadvertent disclosure of important messages

Max-age: Indicates the maximum time the client receives the corresponding message (maximum lifetime)

Max-stale: Indicates that the client can accept a timeout message if the value is specified, indicating that a message within the specified value after the timeout can be accepted

Min-fresh: Represents a message that the client can accept the current time plus the specified value for a time

Only-if-cached: Indicates that the client accepts only cached content

Date The time that the original server message was emitted, and the time the description format was defined by RFC822. For example, Date:mon,31dec200104:25:57gmt. The time described by date represents the world standard, which translates into local time and needs to know the time zone in which the user is located.
Expires Date and time when the response expires
Pragma Page is not allowed to be cached, in the http1.1 version, with Cache-control:no-cache, the same function, in http1.0 does not implement CACHE-CONTROL.PRAGMA only one usage: Pragma:no-cache
User-agent Represents the user information that contains the send request. If the browser is sent, basically is the browser information
Accept-ranges You can request one or more child range fields for a Web page entity. Example: Accept-ranges:bytes
Age Estimated time (in seconds, non-negative) from the original server to the proxy cache
Allow A valid request behavior for a network resource is not allowed to return 405, request behavior: Get,post,head, etc., the following will focus on
Content-encoding Return content compression encoding type supported by the Web server
Content-language The language of the response body
Content-location Alternate alternative address to request resource substitution
Content-md5 Returns the MD5 checksum value of a resource
Content-range The byte position of this section in the entire return body
Connection Whether to maintain a persistent connection. Close: Indicates that a persistent connection is not maintained, Keep-alive: maintain a persistent connection. (HTTP 1.1 defaults to persistent connection)
Cookies Some of the information stored about the client, when sending an HTTP request, sends all cookies under that domain name together to the server.
Content-length The length of the response body content
Content-type Returns the MIME type of the content. If you want to know more MIME types, go to this site: http://www.w3school.com.cn/media/media_mimeref.asp
Via Notifies the intermediary gateway or proxy server address, communication protocol
Warning About warning messages for message entities

ETag The current value of the entity label of the request variable
Last-modified Requested last modified time for resource
Location Used to redirect the receiver to the location of the non-request URL to complete the request or identify the new resource
Proxy-authenticate It indicates the authentication scheme and the parameters on the URL that can be applied to the proxy
Refresh Applied to redirect or a new resource was created, redirected after 5 seconds (proposed by Netscape, supported by most browsers)
Retry-after Notifies the client to try again after a specified time if the entity is temporarily undesirable
Server Web Server Software Name
Set-cookie Set HTTP Cookies
Trailer Indicates that the header domain is present at the end of the chunked transfer code
Transfer-encoding File transfer encoding
Vary Tells the downstream agent whether to use the cache response or request from the original server
Www-authenticate Indicates the authorization scheme that the client request entity should use

For more information about the request header and the response Head, visit the website: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

HTTP request Method:

(Transferred from: http://www.runoob.com/http/http-methods.html)

HTTP1.0 defines three methods of request: GET, POST, and head.

HTTP1.1 has five new request methods: Options, PUT, DELETE, TRACE, and CONNECT methods.

Request method Describe
GET Requests the specified page information and returns the entity principal.
HEAD Similar to a GET request, except that there is no specific content in the returned response to get the header
POST Submits data to the specified resource for processing requests (such as submitting a form or uploading a file). The data is included in the request body. A POST request may result in the creation of new resources and/or modification of existing resources.
PUT Supersedes the contents of the specified document from the data that the client sends to the server.
DELETE Requests that the server delete the specified page.
CONNECT The http/1.1 protocol is reserved for proxy servers that can change connections to pipelines.
OPTIONS Allows clients to view server performance.
TRACE echo the requests received by the server, primarily for testing or diagnostics.

HTTP status code:

(Transferred from: http://www.runoob.com/http/http-status-codes.html)

Classification Category description
1** Information, the server receives the request and requires the requestor to continue the operation
2** Successful, the operation is successfully received and processed
3** Redirect, requires further action to complete the request
4** Client error, request contains syntax error or cannot complete request
5** Server error, the server has an error while processing the request

Basic knowledge of crawler base-http requests

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.