Baidu Encyclopedia on the introduction of reptiles:
Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.
Tools used in the development of crawlers: Chrome browser, fiddler tool, Postman plugin.
Address for fiddler knowledge: http://kb.cnblogs.com/page/130367/
The following is the most basic knowledge: HTTP requests. (The following knowledge is derived from: http://www.runoob.com/http/http-intro.html)
definition :
The HTTP protocol is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol), which is used to transfer hypertext to the local browser from the World Wide Web (www:world Wide Web) server. TTP is a TCP/IP communication protocol that transmits data (HTML files, image files, query results, and so on).
Working principle :
The HTTP protocol works on the client-server architecture. The browser sends all requests via URLs to the HTTP server, which is the Web servers, as an HTTP client.
The Web server has: Apache server, IIS server (Internet information Services) and so on.
The Web server sends a response message to the client, based on the received request. The HTTP default port number is 80, but you can also change to 8080 or another port.
HTTP three-point considerations:
- HTTP is no connection: the meaning of no connection is to limit the processing of only one request per connection. When the server finishes processing the customer's request and receives the customer's answer, the connection is disconnected. In this way, the transmission time can be saved.
- HTTP is media Independent: This means that any type of data can be sent over HTTP as long as the client and the server know what to do with the data content. The client and server specify that the appropriate Mime-type content type be used.
- HTTP is stateless: The HTTP protocol is a stateless protocol. Stateless means that the protocol has no memory capacity for transactional processing. A lack of state means that if the previous information is required for subsequent processing, it must be re-routed, which may cause the amount of data to be transferred per connection to increase. On the other hand, it responds faster when the server does not need the previous information.
HTTP Message structure:
HTTP is an architecture model based on client/server (c/s) that exchanges information through a reliable link and is a stateless request/response protocol.
An HTTP "client" is an application (Web browser or any other client) that connects to the server to achieve the purpose of sending one or more HTTP requests to the server.
An HTTP "server" is also an application (typically a Web service, such as an Apache Web server or an IIS server) that receives requests from clients and sends HTTP response data to the client.
HTTP uses a Uniform Resource identifier (Uniform Resource Identifiers, URI) to transfer data and establish a connection.
Once the connection is established, the data message is routed through the format [RFC5322] and Multipurpose Internet Mail Extensions (MIME) [RFC2045] that are used by similar Internet mail.
Structure diagram of the request message:
The client sends an HTTP request to the server with a request message that includes the following format: A request line, a request header (header), a blank line, and four parts of the request data, giving the general format of the request message.
HTTP request Header Detailed: (Ref.: https://kb.cnblogs.com/page/92320/)
Request Header |
Request Header Properties |
Cache-control |
Specifies the caching mechanism that requests should follow, with cache directives: No-cache, No-store, Max-age, Max-stale, Min-fresh, only-if-cached. No-cache: Indicates that the requested message could not be cached No-store: In the request message, messages that represent requests and responses are not cacheable, in order to prevent the inadvertent disclosure of important messages Max-age: Indicates the maximum time the client receives the corresponding message (maximum lifetime) Max-stale: Indicates that the client can accept a timeout message if the value is specified, indicating that a message within the specified value after the timeout can be accepted Min-fresh: Represents a message that the client can accept the current time plus the specified value for a time Only-if-cached: Indicates that the client accepts only cached content |
Date |
Represents the time that the message was sent, and the time the description format was defined by RFC822. For example, Date:mon,31dec200104:25:57gmt. The time described by date represents the world standard, which translates into local time and needs to know the time zone in which the user is located. |
Pragma |
Anyway the page is cached, in the http1.1 version, with Cache-control:no-cache, the same function, in http1.0 did not implement CACHE-CONTROL.PRAGMA only one usage: Pragma:no-cache |
Host |
The request header domain is primarily used to specify the Internet host and port number of the requested resource, which is usually extracted from the HTTP URL. If it is not the default port 80, the port is specified. 400 error will be reported if host is not specified |
Referer |
Provide context information to the server and tell the server where I have turned this link from. |
Range |
Only part of the entity is requested, and the server can ignore the request. First 500 bytes: bytes=0-499. Second 500 bytes: bytes=500-999. Last 500 bytes: bytes=-500. Range after 500 bytes: bytes=500-. First and last byte: Bytes=0-0,-1. Specify several ranges at the same time: bytes=500-600,601-999 |
User-agent |
Represents the user information that contains the send request. If the browser is sent, basically is the browser information |
Accept |
Represents the type of content that the client can accept, for example: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 If you want to know more MIME types, go to this site: http://www.w3school.com.cn/media/media_mimeref.asp |
Accept-charset |
Represents a set of character encodings that the browser can accept. The most common in the country is UTF8,GBK. Want to know more: https://zh.wikipedia.org/wiki/%E5%AD%97%E7%AC%A6%E7%BC%96%E7%A0%81 |
Accept-encoding |
Specifies that the Web server that the browser can support returns the content compression encoding type. Common: Compress, gzip |
Accept-language |
Specifies the language that the browser can accept. The Common En,zh |
Accept-ranges |
You can request one or more child range fields for a Web page entity. Example: Accept-ranges:bytes |
Authorization |
Authorization Certificate for HTTP Authorization |
Connection |
Whether to maintain a persistent connection. Close: Indicates that a persistent connection is not maintained, Keep-alive: maintain a persistent connection. (HTTP 1.1 defaults to persistent connection) |
Cookies |
Some of the information stored about the client, when sending an HTTP request, sends all cookies under that domain name together to the server. |
Content-length |
Indicates the length of the requested content |
Content-type |
The MIME type of the request entity. If you want to know more MIME types, go to this site: http://www.w3school.com.cn/media/media_mimeref.asp |
Expect |
The specific type of server requested, not too clear, have understood can help me explain |
From |
Email from the user who made the request |
If-match |
Only valid if the request content matches the entity |
If-modified-since |
If the requested part is modified after the specified time, the request succeeds, and the 304 code is returned without modification |
If-none-match |
If the content does not change the return 304 code, the parameter is the server's previously sent ETag, and the server responded to the ETag comparison to determine whether the change |
If-range |
If the entity does not change, the server sends the missing portion of the client, otherwise the entire entity is sent. The parameters are also ETag |
If-unmodified-since |
Request succeeds only if the entity has not been modified since the specified time |
Max-forwards |
Limit the time that information is transmitted through agents and gateways |
Proxy-authorization |
Connect to an authorization certificate for the agent |
TE |
The client is willing to accept the transfer encoding and notifies the server to accept the trailing header information |
Upgrade |
Specify some kind of transport protocol to the server for the server to convert (if supported) |
Via |
Notifies the intermediary gateway or proxy server address, communication protocol |
Warning |
About warning messages for message entities |
HTTP response Header Details:
Request Header |
Request Header Properties |
Cache-control |
Specifies the caching mechanism that requests should follow, with cache directives: No-cache, No-store, Max-age, Max-stale, Min-fresh, only-if-cached. No-cache: Indicates that the requested message could not be cached No-store: In the request message, messages that represent requests and responses are not cacheable, in order to prevent the inadvertent disclosure of important messages Max-age: Indicates the maximum time the client receives the corresponding message (maximum lifetime) Max-stale: Indicates that the client can accept a timeout message if the value is specified, indicating that a message within the specified value after the timeout can be accepted Min-fresh: Represents a message that the client can accept the current time plus the specified value for a time Only-if-cached: Indicates that the client accepts only cached content |
Date |
The time that the original server message was emitted, and the time the description format was defined by RFC822. For example, Date:mon,31dec200104:25:57gmt. The time described by date represents the world standard, which translates into local time and needs to know the time zone in which the user is located. |
Expires |
Date and time when the response expires |
Pragma |
Page is not allowed to be cached, in the http1.1 version, with Cache-control:no-cache, the same function, in http1.0 does not implement CACHE-CONTROL.PRAGMA only one usage: Pragma:no-cache |
User-agent |
Represents the user information that contains the send request. If the browser is sent, basically is the browser information |
Accept-ranges |
You can request one or more child range fields for a Web page entity. Example: Accept-ranges:bytes |
Age |
Estimated time (in seconds, non-negative) from the original server to the proxy cache |
Allow |
A valid request behavior for a network resource is not allowed to return 405, request behavior: Get,post,head, etc., the following will focus on |
Content-encoding |
Return content compression encoding type supported by the Web server |
Content-language |
The language of the response body |
Content-location |
Alternate alternative address to request resource substitution |
Content-md5 |
Returns the MD5 checksum value of a resource |
Content-range |
The byte position of this section in the entire return body |
Connection |
Whether to maintain a persistent connection. Close: Indicates that a persistent connection is not maintained, Keep-alive: maintain a persistent connection. (HTTP 1.1 defaults to persistent connection) |
Cookies |
Some of the information stored about the client, when sending an HTTP request, sends all cookies under that domain name together to the server. |
Content-length |
The length of the response body content |
Content-type |
Returns the MIME type of the content. If you want to know more MIME types, go to this site: http://www.w3school.com.cn/media/media_mimeref.asp |
Via |
Notifies the intermediary gateway or proxy server address, communication protocol |
Warning |
About warning messages for message entities
|
ETag |
The current value of the entity label of the request variable |
Last-modified |
Requested last modified time for resource |
Location |
Used to redirect the receiver to the location of the non-request URL to complete the request or identify the new resource |
Proxy-authenticate |
It indicates the authentication scheme and the parameters on the URL that can be applied to the proxy |
Refresh |
Applied to redirect or a new resource was created, redirected after 5 seconds (proposed by Netscape, supported by most browsers) |
Retry-after |
Notifies the client to try again after a specified time if the entity is temporarily undesirable |
Server |
Web Server Software Name |
Set-cookie |
Set HTTP Cookies |
Trailer |
Indicates that the header domain is present at the end of the chunked transfer code |
Transfer-encoding |
File transfer encoding |
Vary |
Tells the downstream agent whether to use the cache response or request from the original server |
Www-authenticate |
Indicates the authorization scheme that the client request entity should use |
For more information about the request header and the response Head, visit the website: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
HTTP request Method:
(Transferred from: http://www.runoob.com/http/http-methods.html)
HTTP1.0 defines three methods of request: GET, POST, and head.
HTTP1.1 has five new request methods: Options, PUT, DELETE, TRACE, and CONNECT methods.
Request method |
Describe |
GET |
Requests the specified page information and returns the entity principal. |
HEAD |
Similar to a GET request, except that there is no specific content in the returned response to get the header |
POST |
Submits data to the specified resource for processing requests (such as submitting a form or uploading a file). The data is included in the request body. A POST request may result in the creation of new resources and/or modification of existing resources. |
PUT |
Supersedes the contents of the specified document from the data that the client sends to the server. |
DELETE |
Requests that the server delete the specified page. |
CONNECT |
The http/1.1 protocol is reserved for proxy servers that can change connections to pipelines. |
OPTIONS |
Allows clients to view server performance. |
TRACE |
echo the requests received by the server, primarily for testing or diagnostics. |
HTTP status code:
(Transferred from: http://www.runoob.com/http/http-status-codes.html)
Classification |
Category description |
1** |
Information, the server receives the request and requires the requestor to continue the operation |
2** |
Successful, the operation is successfully received and processed |
3** |
Redirect, requires further action to complete the request |
4** |
Client error, request contains syntax error or cannot complete request |
5** |
Server error, the server has an error while processing the request |
Basic knowledge of crawler base-http requests