Python Crawler II Http/https request and response

Last Update:2018-07-27 Source: Internet

Author: User

Tags send cookies domain server http 2

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HTTP and HTTPS

HTTP protocol (Hypertext Transfer Protocol, Hypertext Transfer Protocol): is a way to publish and receive HTML pages.

HTTPS (hypertext Transfer Protocol over secure Socket layer) is simply the secure version of HTTP, which is added to the SSL layer under HTTP.

SSL (Secure Sockets layer) is mainly used for the secure transport Protocol of the Web, which encrypts the network connection at the transport layer and guarantees the security of data transmission on the Internet.

The port number for HTTP is 80,
The port number for HTTPS is 443

How HTTP Works

The crawler crawl process can be understood as the process of simulating browser operation.

The main function of a browser is to make a request to the server to display the network resources you choose in a browser window, which is a set of rules for computers to communicate over the network.

HTTP requests and Responses

HTTP communication consists of two parts: a client request message and a server response message

The process by which the browser sends an HTTP request:

When a user enters a URL in the address bar of the browser and presses the ENTER key, the browser sends an HTTP request to the HTTP server. HTTP requests are mainly divided into "Get" and "Post" methods.
When we enter the URL http://www.baidu.com in the browser, the browser sends a request to get the http://www.baidu.com HTML file, and the server sends the response file object back to the browser.
The browser parses the HTML in response and finds that it references a lot of other files, such as images files, CSS files, and JS files. The browser will automatically send the request again to get a picture, CSS file, or JS file.
When all the files are downloaded successfully, the Web page will be fully displayed according to the HTML syntax structure.

URL (abbreviation for uniform/universal Resource Locator): Uniform Resource Locator, which is an identifying method used to describe the addresses of web pages and other resources on the Internet in a complete manner.

Basic format:

scheme://host[:port#]/path/…/[?query-string][#anchor]

Scheme: Protocol (for example: HTTP, HTTPS, FTP)
Host: The IP address or domain name of the server
port#: Port of the server (default port 80 If you are going to the protocol defaults)
Path: access to Resource paths
Query-string: Parameters, Data sent to the HTTP server
Anchor: Anchor (jumps to the specified anchor position of the Web page)

For example:

Ftp://192.168.0.116:8080/index
Http://www.baidu.com
Http://item.jd.com/11936238.html#product-detail

Client HTTP request

The URL simply identifies the location of the resource, and HTTP is used to commit and fetch the resource. The client sends an HTTP request to the server's request message, including the following format:
request line, request header, blank line, request data Four parts, specific message format slightly.
A typical example of an HTTP request

GET https://www.baidu.com/ HTTP/1.1Host: www.baidu.comConnection: keep-aliveUpgrade-Insecure-Requests: 1User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Referer: http://www.baidu.com/Accept-Encoding: gzip, deflate, sdch, brAccept-Language: zh-CN,zh;q=0.8,en;q=0.6Cookie: BAIDUID=04E4001F34EA74AD4601512DD3C41A7B:FG=1; BIDUPSID=04E4001F34EA74AD4601512DD3C41A7B; PSTM=1470329258; MCITY=-343%3A340%3A; BDUSS=nF0MVFiMTVLcUh-Q2MxQ0M3STZGQUZ4N2hBa1FFRkIzUDI3QlBCZjg5cFdOd1pZQVFBQUFBJCQAAAAAAAAAAAEAAADpLvgG0KGyvLrcyfrG-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFaq3ldWqt5XN; H_PS_PSSID=1447_18240_21105_21386_21454_21409_21554; BD_UPN=12314753; sug=3; sugstore=0; ORIGIN=0; bdime=0; H_PS_645EC=7e2ad3QHl181NSPbFbd7PRUCE1LlufzxrcFmwYin0E6b%2BW8bbTMKHZbDP0g; BDSVRTM=0

Request method

GET https://www.baidu.com/ HTTP/1.1

HTTP requests can use a variety of request methods, depending on the HTTP standard.

HTTP 0.9: Only the basic text GET function.

HTTP 1.0: Complete the request/response model and complement the Protocol, defining three methods of request: GET, POST, and head.

HTTP 1.1: Updated on a 1.0 basis with five new request methods: Options, PUT, DELETE, TRACE, and CONNECT methods.

HTTP 2.0 (not popular): the definition of the request/response header basically does not change, but all the first key must be all lowercase, and the request line to be Independent: method,: Scheme,: Host,:p ath these key value pairs.

Serial Number	Method	Description
1	GET	Requests the specified page information and returns the entity principal.
2	HEAD	Similar to a GET request, except that there is no specific content in the returned response to get the header
3	POST	Submits data to the specified resource for processing requests (such as submitting a form or uploading a file), and the data is included in the request body. A POST request may result in the creation of new resources and/or modification of existing resources.
4	PUT	Supersedes the contents of the specified document from the data that the client sends to the server.
5	DELETE	Requests that the server delete the specified page.
6	CONNECT	The http/1.1 protocol is reserved for proxy servers that can change connections to pipelines.
7	OPTIONS	Allows clients to view server performance.
8	TRACE	echo the requests received by the server, primarily for testing or diagnostics.

HTTP requests are mainly divided into get and post two methods

Get is the data that is fetched from the server and post is the data sent to the server
The GET request parameter is displayed on the browser URL, and the HTTP server generates the response based on the parameters in the URL that the request contains, that is, the parameters of the "Get" request are part of the URL. Example: Http://www.baidu.com/s?wd=Chinese
Post request parameters in the request body, the message length is not limited and implicitly sent, usually used to submit to the HTTP Server a large amount of data (such as the request contains many parameters or file upload operations, etc.), the requested parameters are included in the "Content-type" message header, Indicates the media type and encoding of the message body,

Note: Avoid submitting forms by using GET, because they can cause security issues. For example, in the login form with get, the user entered the user name and password will be exposed in the address bar.

Common Request Headers

1. Host (hostname and port number)
Host: The Web name and port number in the URL that specifies the Internet host and port number of the requested resource, usually part of the URL.

2. Connection (link type)
Connection: Indicates the client-to-service connection type

The Client initiates a request that contains connection:keep-alive, and http/1.1 uses keep-alive as the default value.
After the server receives the request:

If the Server supports keep-alive, reply to a response containing connection:keep-alive, do not close the connection;
If the Server does not support keep-alive, reply to a response that contains connection:close and close the connection.

If the client receives a response that contains connection:keep-alive, the next request is sent to the same connection until the party actively closes the connection.

Keep-alive can reuse connections in many cases, reduce resource consumption, and shorten response times, such as when a browser needs multiple files (such as an HTML file and related graphics files), and does not need to request a connection every time.

3. Upgrade-insecure-requests (upgrade to HTTPS request)
Upgrade-insecure-requests: To upgrade an insecure request, which means that the HTTP resource is automatically replaced with an HTTPS request when it is loaded, so that the browser no longer displays an HTTP request alert in the HTTPS page.

HTTPS is a security-targeted HTTP channel, so HTTP requests are not allowed on HTTPS-hosted pages, as soon as a prompt or an error occurs.

4. User-agent (browser name)
User-agent: Is the name of the customer's browser and will be detailed later.

5. Accept (transfer file type)
Accept: Refers to the MIME (Multipurpose Internet Mail Extensions (Multipurpose Internet Message Extension)) file type acceptable to the browser or other client, which the server can determine and return the appropriate file format.

Example:

Accept: */*：表示什么都可以接收。Accept：image/gif：表明客户端希望接受GIF图像格式的资源；Accept：text/html：表明客户端希望接受html文本。Accept: text/html, application/xhtml+xml;q=0.9, image/*;q=0.8：表示浏览器支持的 MIME 类型分别是 html文本、xhtml和xml文档、所有的图像格式资源。

Q is the weight factor, the greater the range 0 =< q <= 1,q value, the more the request tends to get its ";" The previous type represents the content. If the Q value is not specified, the default is 1, left-to-right sort order, and if assigned to 0, it is used to indicate that the browser does not accept this content type.

Text: Used to standardize the representation of textual information, text messages can be in a variety of character sets and or multiple formats; application: Used to transfer application data or binary data.

6. Referer (page jump)
Referer: Indicates the URL from which the requested page was generated, and the user is accessing the page from the Referer page to the current request. This property can be used to track which page the Web request came from, what site it was from, and so on.

Sometimes encountered downloading a website picture, need the corresponding referer, otherwise cannot download the picture, that is because others do the anti-theft chain, the principle is according to Referer to judge whether is this website address, if not, then refuses, if is, can download.

7. accept-encoding (file codec format)
Accept-encoding: Indicates how the browser can accept the encoding. Encoding differs from file format in order to compress files and speed up file delivery. The browser decodes the Web response after it receives it and then checks the file format, which in many cases can reduce the amount of download time.

Example: accept-encoding:gzip;q=1.0, identity; q=0.5, *;q=0
If multiple encoding are matched at the same time, in the order of Q values, in this case in order, Gzip is supported, the identity compression is encoded, and the gzip-enabled browser returns a gzip-encoded HTML page. If this domain server is not set in the request message, the client is assumed to be acceptable for various content encodings.

8. Accept-language (language type)
Accept-langeuage: Indicates the type of language that the browser can accept, such as en or en-us, English, en or ZH-CN, when the server is able to provide more than one language version.

9. Accept-charset (character encoding)
Accept-charset: Indicates the character encoding that the browser can accept.

Example: Accept-charset:iso-8859-1,gb2312,utf-8

Iso8859-1: usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages, and the default value for English browsers is iso-8859-1.
GB2312: Standard Simplified Chinese character set;
Utf-8:unicode is a variable-length character encoding that solves multiple language text display problems, enabling application internationalization and localization.

If the field is not set in the request message, the default is to accept any character set.

Ten. Cookies (Cookies)
Cookie: This property is used by the browser to send cookies to the server. Cookies are small data bodies that are stored in a browser, which can record user information related to the server, and can also be used to implement conversational functions.

One. Content-type (post data type)
The type of content that is used in the Content-type:post request.

Example: Content-type = Text/xml; charset=gb2312:
Indicates that the message body of the request contains data of the plain text XML type, with the character encoding "gb2312".

Server-Side HTTP response

The HTTP response is also made up of four parts: status line, message header, blank line, response body

HTTP/1.1 200 OKServer: TengineConnection: keep-aliveDate: Wed, 30 Nov 2016 07:58:21 GMTCache-Control: no-cacheContent-Type: text/html;charset=UTF-8Keep-Alive: timeout=20Vary: Accept-EncodingPragma: no-cacheX-NWS-LOG-UUID: bd27210a-24e5-4740-8f6c-25dbafa9c395Content-Length: 180945<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ....

Response Status Code

The response status code consists of three digits, and the first number defines the category of the response, and there are five possible values.

Common Status Codes:

100~199: Indicates that the server successfully received a partial request, requiring the client to continue submitting the remaining requests in order to complete the process.
200~299: Indicates that the server successfully received the request and completed the entire processing process. Common (OK request successful).
300~399: To complete the request, the customer needs to refine the request further. For example: The requested resource has been moved to a new address, common 302 (the requested page has been temporarily moved to a new URL), 307, and 304 (using cached resources).
400~499: Client request error, common 404 (the server cannot find the requested page), 403 (server denied access, insufficient permissions).
500~599: Error on server side, Common 500 (request not completed.) The server is experiencing unpredictable conditions).

Cookies and Session:

The interaction between the server and the client is limited to the request/response process and is disconnected after the end, and the server will consider the new client on the next request.

In order to maintain a link between them, let the server know that this is a request sent by a previous user, you must save the client's information in one place.

Cookie: Determines the user's identity by the information recorded on the client.

Session: Determines the user's identity by information logged on the server side.

Python Crawler II Http/https request and response

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More