What is a reptile? What is the basic flow of reptiles?

Last Update:2017-07-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Web crawler is a program, mainly used in search engines, it will be a site of all the content and links to read, and establish the relevant full-text index into the database, and then jump to another site. Looks like a big spider.
When people search for keywords on the web (such as Google), it is actually better than the content in the database to identify the user. The quality of the Web crawler determines the ability of the search engine, such as Google's search engine is obviously better than Baidu, is because of its high efficiency network crawler, programming structure is good.

First, what is a reptile　

　　First, a simple understanding of the crawler. A process that requests a Web site and extracts the data that it needs. As for how to crawl how to crawl, will be back to learn the content, for the moment do not have to delve into. Through our program, we can send the request to the server instead of us, and then do the bulk, large amount of data download.

Second, the basic flow of reptiles

Initiating a request: A request can be made to the server via a URL, which may contain additional header information.
Get response content: If the server responds properly, we will receive a response,response that is the content of the Web page we requested, perhaps including the Html,json string or binary data (video, image), etc.
Parsing content: If it is HTML code, you can use the page parser to parse, if it is JSON data, you can convert to JSON object for parsing, if it is binary data, you can save to file for further processing.
Save data: Can be saved to a local file or saved to a database (MYSQL,REDIS,MONGODB, etc.)

　 Iii. What the request contains

　　　　What information does this request contain when we send request requests to the server through the browser? We can use Chrome's developer tools (if you don't know how to read this note).

How to request: the most common request methods include get requests and post requests. The most common aspect of post requests in development is through forms submission, which is most common from the user's point of view, which is login verification. When you need to enter some information to log in, this request is a POST request.
URL Uniform Resource Locator: a URL, a picture, a video, and so on can be defined by the URLs. When we request a Web page, we can view the network tag, the first is usually a document, that is, the document is a non-external image, CSS, JS and other rendering of the HTML code, Below this document we will see a series of jpg,js and so on, this is the browser based on the HTML code to launch the request again and again, and the requested address, that is, the HTML document image, JS and other URL address
Request headers: The requested header, including the request type, cookie information, browser type, and so on. This request header in our web page crawl is still a bit of a function, the server will parse the request header for information audit, judge this request is a legitimate request. So when we use the program to disguise the browser to make the request, we can set the request header information.
Request body: The POST request will wrap the user information in Form-data, so the content of the headers tag of the POST request will be more than the form data packet, compared to the GET request. A GET request can be simply understood as a normal search return, and the information will be added at the end of the URL after the interval.

　Iv. what is included in response

Response Status: Status code can be seen through general in headers. 200 indicates success, 301 jump, 404 Page Not found, 502 server error, etc.
Response header: Includes the type of content, cookie information, and so on.
Response Body: The purpose of the request is to get the response body, including HTML code, JSON, and binary data.

　 V. Simple Request Demo

　　　　Request a Web page from Python's requests library:

　　　　The result of the output is a page code that has not yet been rendered, that is, the content of the request body. You can view the information for the response header:

　　　　View status code:

　　　　You can also add a request header to the request message:

　　　　Grab pictures (Baidu logo):

　Vi. how to solve JavaScript rendering problems

　　　　Using Selenium Webdriver

　　　　Enter print (Driver.page_source) to see that this time the code is the code after rendering.

Use of the "Notes" Chrome browser

F12 Open Developer Tools

　　The elements tag shows the apparent post-HTML code.

Network label

　　The Network tab has the data requested by the browser, click to view detailed information, such as the mentioned request headers, response headers and so on.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What is a reptile? What is the basic flow of reptiles?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

What is a reptile? What is the basic flow of reptiles?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support