Python crawler
What is the nature of a reptile?
Simulate a browser to open a webpage and get the part of the data we want on the page
The process by which the browser opens the Web page:
When you enter the address in the browser, after the DNS server to find the server host, send a request to the server, the server is parsed and sent to the user's browser results, including html,js,css and other file content, the browser resolves the final presentation to the user on the browser to see the results
So the user sees the result of the browser is made of HTML code, we crawler is to obtain these content, through the analysis and filtering of HTML code, from which we want resources (text, images, video ...). )
Second, the basic flow of reptiles
- initiating a request: A request is made through the HTTP library to the target site, that is, sending a request that can contain additional headers and other information, waiting for the server to respond
- Get response content: If the server can respond normally, it will get a response,response content that is the content of the page to get, type may be Html,json string, binary data (image or video) and other types
- parsing content: The resulting content may be HTML, you can use regular expressions, page parsing library for parsing, possibly JSON, can be directly converted to JSON object parsing, may be binary data, can be saved or further processing
- Save Data: save in a variety of forms, save as text, save to a database, or save a file in a specific format
Iii. analysis of Request and response:
The browser sends a message to the server where the URL is located, this process is called HTPP Request, the server receives the message sent by the browser, can be based on the browser to send the content of the message, the corresponding processing, and then the message back to the browser, the process is HTTP Response
1. Request Analysis:
1.1. Request Method:
"" "There are: Get/post two types commonly used, there are also head/put/delete/optionsget and post the difference is: The requested data GET is in the URL, post is stored in the head GET:" Display "request to the specified resource. Using the Get method should be used only for reading data, not for actions that produce "side effects", such as in Web application. One reason for this is that get may be randomly accessed by web spiders such as post: submitting data to a specified resource, requesting the server to process it (such as submitting a form or uploading a file). The data is included in the request for this article. This request may create new resources or modify existing resources, or both. HEAD: As with the Get method, it is a request to the server for the specified resource. Only the server will not return the resources to this section of this article. The advantage of this approach is that it allows you to get "information about this resource" (meta-information or meta-data) without having to transfer the entire content. PUT: Uploads its latest content to the specified resource location. OPTIONS: This method enables the server to return all HTTP request methods supported by the resource. Use the ' * ' instead of the resource name to send the options request to the Web server to test whether the server function is functioning properly. Delete: The request server deletes the resource identified by the Request-uri. """
1.2. URL of the request
"" "url, that is, the Uniform Resource Locator, that is, we say the URL, the Uniform Resource Locator is a resource available from the Internet location and access methods of a concise representation of the Internet is the address of standard resources. Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it. The format of a URL consists of three parts: the first part is the protocol (or service mode). The second part is the host IP address (and sometimes the port number) where the resource is stored. The third part is the specific address of the host resource, such as directory and file name. Crawlers crawl data must have a target URL to get the data, so it is the basis for the crawler to obtain data. """
1.3. Request Header
Baidu Homepage Request Header:
Request Header Field Interpretation:
The Accept setting accepts the content type Accept-charset set the accepted character encoding accept-encoding set the accepted encoding format Accept-datetime set the accepted version time Accept-language Set the accepted language authorization set the credentials for HTTP authentication Cache-control set the request response chain all the caching mechanisms must follow the instructions connection Set control options for the current connection and Hop-by-hop Protocol request field List content-length set the byte length of the request body CONTENT-MD5 set the Base64 binary encoding of the request body content based on the MD5 algorithm Content-type Set the MIME type of the request body (for Post and put requests) Cookie setting the date and time the message was sent using the HTTP cookiedate set by the server Set-cookie expect identify the special browser behavior that the client requires forwarded Disclosing the source information of a client connecting to a Web service through an HTTP proxy the from setting the email address of the user sending the request to the host setting the server domain name and TCP port number, if the service request standard port number is used, the port number can be omitted "" "
1.4. The request body
Requests are data that is carried, such as form data (POST) when submitting form data
2, Response Analysis:
2.1. Response Status:
There are various response states, such as: 200 for success, 301 jump, 404 Page Not found, 502 server error
- 1XX message--The request has been received by the server to continue processing
- 2XX Success-The request has been successfully received, understood, and accepted by the server
- 3xx redirection-Requires a follow-up action to complete this request
- 4XX Request Error--Request contains lexical error or cannot be executed
- 5XX Server Error--common code that occurs when the server is handling a correct request: A successful request for a $ OK requests client request syntax error, cannot be understood by the server 401 Unauthorized request unauthorized, this status code must and Www-a The Uthenticate header domain is used together with the 403 Forbidden server to receive the request, but the denial of service 404 Not Found The request resource does not exist, eg: the wrong URL was entered and an unexpected error occurred on the server error Error 503 Server unavailable server is currently unable to process client requests, may return to normal 301 target permanent transfer after a period of 302 target transient transfer
2.2, the corresponding head:
2.3. Response Body:
Content that contains the requested resource, such as Web page HTML, images, binary data, etc.
Python crawler--a first-knowledge crawler