1 About URLs
URL (uniform/universal Resource Locator): Uniform Resource Locator, an identifying method for fully describing the addresses of web pages and other resources on the Internet
URLs are the portals of crawlers--very important.
Basic format:
scheme://host[:p ort#]/path/.../[?query-string][#anchor]
Scheme: Protocol (for example: HTTP, HTTPS, FTP)
Host: The IP address or domain name of the server
port#: Server port (protocol default port 80, default)
Path: access to Resource paths
Query-string: Data sent to HTTP server
Anchor: Anchor (jumps to the specified anchor position on the page)
Example:
Http://www.baidu.com
Http://item.jd.com/11963485.html#product-detail
Ftp://192.168.1.118:8081/index
2 HTTP protocol, HTTPS protocol 2.1 HTTP protocol
HTTP protocol (Hypertext Transfer Protocol, Hypertext Transfer Protocol): is a way to publish and receive HTML pages. The HTTP protocol is an application-layer protocol with no connection (only one request per connection), stateless (each connection, transmission is independent)
2.2 HTTPS protocol
The HTTPS (hypertext Transfer Protocol over secure Socket layer) protocol is simply the secure version of HTTP, which joins the SSL layer under HTTP. HTTPS = HTTP+SSL (secure Sockets layer Secure socket) is primarily used for secure transport protocols on the web, encrypting network Connections at the transport layer and securing data transfer over the Internet
Note:
HTTP port number is a , HTTPS port number is 443;
3 HTTP Request Request
HTTP requests are commonly used in two ways:
(1) Get: "Get" is to obtain information from the server, the process of transmitting data to the server is not secure, the data size is limited;
(2) Post: "Send", the transfer of data to the server, the process of transmitting data is safe, the size of theoretically unlimited;
HTTP Other Request method:
4 User-agent User Agent
HTTP header user-agent (UA) is a user agent, is part of the head domain, is a special string header, is a visit to the site to provide you with the browser type and version, operating system and version, browser kernel, and other information identification. Through this logo, users visit the site can display a different layout to provide users with better experience or information statistics, such as mobile phone access to Google and computer access is not the same, these are Google according to the visitor's UA to judge.
The UA can be disguised. That could be used to disguise the crawler.
The standard format of the browser's UA string: The browser identity (operating system identity; encryption level identification; browser language) The rendering engine identifies version information. But each browser is different.
Note: For compatibility and promotion purposes, many browsers have the same identity, so the browser logo does not explain the actual version of the browser, the actual version information can be found in the tail of the UA string.
user-agent:mozilla/5.0 (Windows NT 10.0; ...) Gecko/20100101 firefox/59.0
Network capture can be realized through software Fiddler, Wireshark, etc.
A detailed description of the grab Kit Wireshark Network grab Bag
5 Status Code 2, status code for HTTP response response
Depending on the type of response result, it is broadly divided into the following categories:
2.1 1XX (Information Class)
The type status code indicates that the request was received and continues processing.
- 100, the client must continue to make the request.
- 101, the client requires the server to convert the HTTP protocol version on request.
2.2 2XX (response successful)
The type status code indicates that the action was successfully received, understood, and accepted.
- 200, indicating that the request was successfully completed and that the requested resource was sent to the client.
- 201, prompt to know the URL of the new file.
- 202, accepted and processed, but processing is not completed.
- 203, the returned information is indeterminate or incomplete.
- 204, the request is received, but the return information is empty.
- 205, the server completes the request, the user must reset the currently browsed files.
- 206, the server has completed a partial user get request.
2.3 3XX (redirect Class)
The type status code indicates that further processing must be accepted in order to complete the specified action.
- 300, the requested resource can be obtained in multiple places.
- 301, this page is permanently transferred to another URL.
- 302, the requested page is redirected to the new address.
- 303, we recommend that users access other URLs or access methods.
- 304, the requested webpage has not been modified since the last request.
- 305, the requested resource must be obtained from the address specified by the server.
- 306, the code used in the previous version of HTTP is now no longer in use.
- 307, declare the requested resource temporarily deleted.
2.4 4XX (Client error Class)
The type status code indicates that the request contains an error syntax or does not execute correctly.
- 400, the client request has a syntax error.
- 401, the request was unauthorized.
- 402, retain valid Chargeto head response.
- 403, access is forbidden, the server receives the request, but refuses to provide the service.
- 404, the server can be connected, but the server cannot get the requested Web page, the request resource does not exist.
- 405, the method defined by the user in the Request-line field is not allowed.
- 406, according to the user sent the accept, the request resource is inaccessible.
- 407, similar to 401, the user must first obtain authorization on the proxy server.
- 408, the client does not complete the request within the user-specified time.
- 409, the request cannot be completed for the current resource state.
- 410, this resource is no longer available on the server.
- 411, the server rejects the user-defined Content-length property request.
- 412, one or more request header fields are wrong in the current request.
- 413, the requested resource is larger than the size allowed by the server.
- 414, the requested resource URL is longer than the length allowed by the server.
- 415, the request resource does not support the request item format.
- 416, the request contains a range request header field that does not have a range indication value within the current request resource.
- 417, the server does not meet the expected value specified by the request Expect header field.
2.5 5XX (Server error Class)
The Type status code indicates a server or gateway error.
- 500, server error.
- 501, the server does not support the requested feature.
- 502, gateway error.
- 503, unable to get service.
- 504, the gateway timed out.
- 505, the HTTP version is not supported.
3. Response Head
is a limitation of the response and contains many attributes. The commonly used properties are:
- Location to implement request redirection.
- Server, basic information about the servers.
- Content-encoding, the compression format used by the server when sending data.
- Content-language, the language in which the data is sent.
- Content-type, the type of data being sent.
- Content-length, the size of the sent data.
- Set-cookie, send the Cookie to the client.
- Last-modified that indicates the last modification date and time of the resource.
Python Learning notes-crawler-related network knowledge