Web crawler is a program, mainly used in search engines, it will be a site of all the content and links to read, and establish the relevant full-text index into the database, and then jump to another site. Looks like a big spider.
When people search for keywords on the web (such as Google), it is actually better than the content in the database to identify the user. The quality of the Web crawler determines the ability of the search engine, such as Google's search engine is obviously better than Baidu, is because of its high efficiency network crawler, programming structure is good.
First, what is a reptile
First, a simple understanding of the crawler. A process that requests a Web site and extracts the data that it needs. As for how to crawl how to crawl, will be back to learn the content, for the moment do not have to delve into. Through our program, we can send the request to the server instead of us, and then do the bulk, large amount of data download.
Second, the basic flow of reptiles
Initiating a request: A request can be made to the server via a URL, which may contain additional header information.
Get response content: If the server responds properly, we will receive a response,response that is the content of the Web page we requested, perhaps including the Html,json string or binary data (video, image), etc.
Parsing content: If it is HTML code, you can use the page parser to parse, if it is JSON data, you can convert to JSON object for parsing, if it is binary data, you can save to file for further processing.
Save data: Can be saved to a local file or saved to a database (MYSQL,REDIS,MONGODB, etc.)
Iii. What the request contains
What information does this request contain when we send request requests to the server through the browser? We can use Chrome's developer tools (if you don't know how to read this note).
How to request: the most common request methods include get requests and post requests. The most common aspect of post requests in development is through forms submission, which is most common from the user's point of view, which is login verification. When you need to enter some information to log in, this request is a POST request.
URL Uniform Resource Locator: a URL, a picture, a video, and so on can be defined by the URLs. When we request a Web page, we can view the network tag, the first is usually a document, that is, the document is a non-external image, CSS, JS and other rendering of the HTML code, Below this document we will see a series of jpg,js and so on, this is the browser based on the HTML code to launch the request again and again, and the requested address, that is, the HTML document image, JS and other URL address
Request headers: The requested header, including the request type, cookie information, browser type, and so on. This request header in our web page crawl is still a bit of a function, the server will parse the request header for information audit, judge this request is a legitimate request. So when we use the program to disguise the browser to make the request, we can set the request header information.
Request body: The POST request will wrap the user information in Form-data, so the content of the headers tag of the POST request will be more than the form data packet, compared to the GET request. A GET request can be simply understood as a normal search return, and the information will be added at the end of the URL after the interval.
Iv. what is included in response
Response Status: Status code can be seen through general in headers. 200 indicates success, 301 jump, 404 Page Not found, 502 server error, etc.
Response header: Includes the type of content, cookie information, and so on.
Response Body: The purpose of the request is to get the response body, including HTML code, JSON, and binary data.
V. Simple Request Demo
Request a Web page from Python's requests library:
The result of the output is a page code that has not yet been rendered, that is, the content of the request body. You can view the information for the response header:
View status code:
You can also add a request header to the request message:
Grab pictures (Baidu logo):
Vi. how to solve JavaScript rendering problems
Using Selenium Webdriver
Enter print (Driver.page_source) to see that this time the code is the code after rendering.
Use of the "Notes" Chrome browser
The elements tag shows the apparent post-HTML code.
The Network tab has the data requested by the browser, click to view detailed information, such as the mentioned request headers, response headers and so on.