Python crawler learns the fundamentals of a------HTTP

Last Update:2018-06-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Yesterday just bought Cui big "Python3 Network Crawler development actual combat", today on to, happy to read the Crawler Foundation this chapter, now record their shallow understanding, if there is no insight into place, hope points out.

Fundamentals of 1.HTTP

① We often enter the www.baidu.com URL in the browser, what exactly is this URL? In fact, this is the URL, i.e. (Universal Resource Locator) Uniform Resource Locator. URL clear the location of the page where Baidu returned to us. In fact, the URL is another subset called a URI, the URI (Universal Resource Identifier) is translated as a Uniform Resource identifier, which indicates that there is such a resource.

② What is hypertext? We know that the HTTP name is Hypertext Transfer Protocol, what exactly is hyper-text? In fact, hypertext is simply the source code of the Web page HTML, with a series of labels composed of text.

③http protocol and HTTPS, borrowed from Cui book on the understanding of the HTTP protocol, the HTTP protocol is used to transfer hypertext data from the network to the local browser protocol, it can ensure efficient and accurate transmission of hypertext documents.

and HTTPS English is all called Hyper Text Transfer over Secure Socket layer, that is, the application layer and the transport layer in the middle of a layer of Secure sockets layers (SSL), the Secure Sockets layer for the safe transmission of data to provide protection. SSL layer through the asymmetric encryption algorithm first negotiated out the secret key, in the use of symmetric encryption algorithm for the transmission of data using secret encryption and decryption, in order to form the data encryption, specific encryption can refer to this article: 50378855

④http the request process, when we enter a www.baidu.com in the browser, enter after we will see the page in the browser, which exactly what happened to the process?

1) The browser sends a request to the DNS server where the domain name resides, and the DNS server returns the IP address of the server to which the browser www.baidu.com resides.

2) The browser makes a TCP three handshake to the IP address of the server where the www.baidu.com resides.

3) After the connection is successfully established, the HTTP request message is sent.

4) When the server receives the request, it will return the corresponding response to the user.

5) When the browser accepts the response, it will parse the response, display it, and, if no data is required to be transmitted and use a short connection, 4 times the wave process is disconnected.

⑤http's request contains four parts of the content, the requested way (Get/post), the requested path, the request header information, and the request body.

1) How to request:

The difference between get and POST requests: When a GET request is requested, the parameters are included in the URL to the backend, the parameters can be seen in the URL, and the POST request sends the information in the request body to the backend, and the greatest benefit of the POST request is that the requested content is not visible, just imagine, If a GET request is used when logging in, the password is displayed in the browser's URL without any reservation.

Other common request methods: Head request: Similar to a GET request, only the header information of the response is returned.

Delete Request: The requesting server deletes the specified page.

Connect request (as if not seen, haha): Take the server as a springboard, let the server instead of the client to access other pages.

2) Request header information: The header of the request contains many important fields, we use the crawler to construct the request header according to the specific page.

Accept: Lets you specify which types of information the client can accept.

Host: Used to specify the IP and port number of the request.

Cookies (it is important to remind yourself): The site stores data that is stored locally by the user in order to identify the user for session tracking. Its main function is to maintain the session, we often visit a long time have not visited the site will prompt the cookie has expired, need to re-login, in fact, this is the cookie is a mischief, cookies are crawling to those who need to log in to obtain data of the site is particularly important.

User-agent (also important): we may have just started to write a crawler without any request header information, this time the server received a request to check the request header information in the User-agent may think, this lad a look is a novice, even UA is not constructed to send me a request , when I was really a fool! Just give you a denial of service. So, just in case, try to construct a user-agent.

Content-type: Usually also known as the Internet media Type (MIME), used to represent the specific type of information requested by the client, for example, the request is HTML, you need Content-type set to text/html, the request is JSON data, Then change the Content-type to Application/json. Note that you need to set Content-type to application/x-www-form-urlencoded when you log in, and it will be submitted as a form.

3) The request body: The general request body when using the Post method will put the data in the request body, and the GET request body is empty.

⑥ response

1) Common response status code:

200: Request succeeded

301: The requested page has been permanently moved to a new location

302: The requested webpage temporarily jumps to another page

403:forbidden, the server rejects this request

404:not Found, the server could not find the requested Web page

405:bad method, the server prohibits this request

500: Server Internal Error

2) Response header information, here are only a few common.

Content-type: Specifies what type of data is returned.

Set-cookie: Set a cookie to tell the browser to place this content in a cookie and request it next time.

Content-encoding: Specifies the encoding of the response content.

3) Response body: Usually the data that the server returns to us, such as the source code of the webpage, JSON data, etc., if you want to know exactly what it is, you can bring up the developer tool and use the Network tool to view it via F12.

This HTTP rationale is written here, next page basics and crawler fundamentals. Let's not say it, I want to learn.

Python crawler learns the fundamentals of a------HTTP

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler learns the fundamentals of a------HTTP

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support