Python Crawler One

Last Update:2018-07-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a reptile?

What can a reptile do?

The nature of Reptiles

The basic flow of reptiles

What is Request&response

What to do if you crawl to the data

What is a reptile?

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Other infrequently used names are ants, auto-indexing, simulation programs, or worms.

In fact, the popular talk is through the program to get the Web page of the data you want, that is, automatically fetch data

What can a reptile do?

You can crawl your little sister's pictures, crawl your interested island country videos, or anything else you want, provided that the resources you want must be accessible through the browser.

What is the nature of reptiles?

The above is about what a crawler can do, defines a premise, is any resource that the browser can access, especially for the scholar who knows the Web request life cycle, the nature of the crawler is simpler. The essence of crawler is to simulate the browser to open the Web page, get the part of the page we want to data.

The process by which the browser opens the Web page:

1. In the input Address field of the browser, enter the URL you want to visit.

2, after the DNS server to locate the server host, send a request to the server

3, the server after parsing processing returned to the user results (including html,js,css files and so on content)

4, the browser receives the result, interprets the user result through the browser screen

Above, we say that the essence of a reptile is an automated program that simulates the browser's automatic sending of requests to the server to obtain, process, and parse the results.

Crawler key points: Simulation requests, parsing processing, automation.

The basic flow of reptiles

Initiating a request

Requests are made to the target site via the HTTP library, and requests can be

Contains additional headers and other information, waiting for the server to respond

Get response Content
If the server responds normally, it will get a response,response content that is the content of the page to get, type may be Html,json string, binary data (image or video) and other types

Parsing content
The resulting content may be HTML, you can use regular expressions, page parsing library for parsing, perhaps JSON, can be directly converted to JSON object parsing, may be binary data, can be saved or further processing

Save data
Saved in a variety of forms, can be saved as text, can also be stored in a database, or save a file in a specific format

Request & Response

The browser sends a message to the server where the URL is located, and this process is called HTPP Request

After the server receives the message sent by the browser, it can send the message to the browser according to the content, do the corresponding processing, and then send the message back to the browser, this process is HTTP Response

When the browser receives the server's response information, the information is processed accordingly and then presented to the user via the display

We take a visit to Baidu as an example:

What does the request contain?

Request method

Mainly: Get/post two types are commonly used, in addition to Head/put/delete/options
The difference between get and post is that the requested data get is in the URL, the post is stored in the header

GET: Issue a "display" request to the specified resource. Using the Get method should be used only for reading data, not for actions that produce "side effects", such as in Web application. One of the reasons is that get can be accessed by web spiders and other casual

POST: Submits data to the specified resource and requests the server to process it (such as submitting a form or uploading a file). The data is included in the request for this article. This request may create new resources or modify existing resources, or both.

HEAD: As with the Get method, it is a request to the server for the specified resource. Only the server will not return the resources to this section of this article. The advantage of this approach is that it allows you to get "information about this resource" (meta-information or meta-data) without having to transfer the entire content.

PUT: Uploads its latest content to the specified resource location.

OPTIONS: This method enables the server to return all HTTP request methods supported by the resource. Use the ' * ' instead of the resource name to send the options request to the Web server to test whether the server function is functioning properly.

Delete: The request server deletes the resource identified by the Request-uri.

Request URL

URL, the Uniform Resource Locator, which is what we call the URL, the Uniform Resource Locator is a concise representation of the location and access methods of resources available from the Internet, and is the address of standard resources on the Internet. Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it.

The format of a URL consists of three parts:
The first part is the protocol (or service mode).
The second part is the host IP address (and sometimes the port number) where the resource is stored.
The third part is the specific address of the host resource, such as directory and file name.

Crawlers crawl data must have a target URL to get the data, so it is the basis for the crawler to obtain data.

Request Header

Contains information such as header information, such as User-agent,host,cookies, is requested when the request Baidu, all the request header information parameters

Request Body
Requests are data that is carried, such as form data (POST) When submitting form form data

What's included in the response

The first line of all HTTP responses is the status line, followed by the current HTTP version number, the 3-digit status code, and the phrase that describes the state, separated by spaces from each other.

Response status

There are various response states, such as: 200 for success, 301 jump, 404 Page Not found, 502 server error

1XX message--The request has been received by the server to continue processing
2XX Success-The request has been successfully received, understood, and accepted by the server
3xx redirection-Requires a follow-up action to complete this request
4XX Request Error--Request contains lexical error or cannot be executed
5XX Server Error--common code that occurs when the server is handling a correct request: A successful request for a $ OK requests client request syntax error, cannot be understood by the server 401 Unauthorized request unauthorized, this status code must and Www-a The Uthenticate header domain is used together with the 403 Forbidden server to receive the request, but the denial of service 404 Not Found The request resource does not exist, eg: the wrong URL was entered and an unexpected error occurred on the server error Error 503 Server unavailable server is currently unable to process client requests, may return to normal 301 target permanent transfer after a period of 302 target transient transfer

Response header

such as content type, length of type, server information, setting cookies, such as:

Response body

The most important part, including the content of the request resource, such as Web page HTML, picture, binary data, etc.

Crawl data types

Web page text: such as HTML document, JSON formatted text and other pictures: Get binary file, save as Picture format Video: Same binary file other: As long as the request to, can get

Parsing Data methods

1 Direct Processing 2 JSON parsing 3 regular-expression processing 4 beautifulsoup Analytic processing 5 pyquery Analytic processing 6 XPath parsing processing

About crawling page data and browsers see the different problems

This occurs because the data in many sites is dynamically loaded through Js,ajax, so the pages that are obtained directly from the GET request are displayed differently from the browser. How to solve the problem of JS rendering? Analysis ajaxselenium/webdriversplashpyv8,ghost.py

Save data

Text: Plain text, json,xml, etc.

Relational databases: structured databases such as Mysql,oracle,sql server

Non-relational database: Mongodb,redis and other key-value form storage

Python Crawler One

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler One

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler One

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support