Python crawler Basics (continue to add)

Last Update:2018-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learned so long crawler, today tidy up the relevant knowledge points, will continue to update

HTTP and HTTPS

HTTP protocol (Hypertext Transfer Protocol, Hypertext Transfer Protocol): is a way to publish and receive HTML pages.

HTTPS (hypertext Transfer Protocol over secure Socket layer) is simply the secure version of HTTP, which is added to the SSL layer under HTTP.

SSL (Secure Sockets layer) is mainly used for the secure transport Protocol of the Web, which encrypts the network connection at the transport layer and guarantees the security of data transmission on the Internet.

HTTPThe port number is 80 ,
HTTPSThe port number is443

How HTTP Works

The crawler crawl process can be understood as 模拟浏览器操作的过程 .

The main function of a browser is to make a request to the server to display the network resources you choose in a browser window, which is a set of rules for computers to communicate over the network.

HTTP requests are mainly divided into GetAnd PostTwo methods

Get is the data that is fetched from the server and post is the data sent to the server
The GET request parameter is displayed on the browser URL, and the HTTP server generates the response based on the parameters in the URL that the request contains, that is, the parameters of the "Get" request are part of the URL. For example:http://www.baidu.com/s?wd=Chinese
Post request parameters in the request body, the message length is not limited and implicitly sent, usually used to submit to the HTTP Server a large amount of data (such as the request contains many parameters or file upload operations, etc.), the requested parameters are included in the "Content-type" message header, Indicates the media type and encoding of the message body,

Note: Avoid submitting forms by using GET, because they can cause security issues. For example, in the login form with get, the user entered the user name and password will be exposed in the address bar.

Common Request Header: 1. Host (hosts and port numbers)

Host: The Web name and port number in the URL that specifies the Internet host and port number of the requested resource, usually part of the URL.

2. Connection (link type)

Connection: Indicates the client-to-service connection type

The Client initiates a contained Connection:keep-alive request that http/1.1 used keep-alive as the default value.
After the server receives the request:
- If the Server supports keep-alive, reply to a response containing connection:keep-alive, do not close the connection;
- If the Server does not support keep-alive, reply to a response that contains connection:close and close the connection.
If the client receives Connection:keep-alive the included response, the next request is sent to the same connection until the party actively closes the connection.

Keep-alive can reuse connections in many cases, reduce resource consumption, and shorten response times, such as when a browser needs multiple files (such as an HTML file and related graphics files), and does not need to request a connection every time.

3. Upgrade-insecure-requests (upgrade to HTTPS request)

Upgrade-insecure-requests: To upgrade an insecure request, which means that the HTTP resource is automatically replaced with an HTTPS request when it is loaded, so that the browser no longer displays an HTTP request alert in the HTTPS page.

HTTPS is a security-targeted HTTP channel, so HTTP requests are not allowed on HTTPS-hosted pages, as soon as a prompt or an error occurs.

4. User-agent (browser name)

User-agent: Is the name of the customer's browser

5. Accept (transfer file type)

Accept: Refers to the MIME (Multipurpose Internet Mail Extensions (Multipurpose Internet Message Extension)) file type acceptable to the browser or other client, which the server can determine and return the appropriate file format.

Example:

Accept: */*: Indicates that anything can be received.

Accept：image/gif: Indicates the client wants to accept the GIF image format resources;

Accept：text/html: Indicates that the client wants to accept HTML text.

Accept: text/html, application/xhtml+xml;q=0.9, image/*;q=0.8: Indicates that the MIME types supported by the browser are HTML text, XHTML and XML documents, and all image format resources.

Q is the weight factor, the greater the range 0 =< q <= 1,q value, the more the request tends to get its ";" The previous type represents the content. If the Q value is not specified, the default is 1, left-to-right sort order, and if assigned to 0, it is used to indicate that the browser does not accept this content type.

Text: Used to standardize the representation of textual information, text messages can be in a variety of character sets and or multiple formats; application: Used to transfer application data or binary data. For details, please click

6. Referer (page jump)

Referer: Indicates the URL from which the requested page was generated, and the user is accessing the page from the Referer page to the current request. This property can be used to track which page the Web request came from, what site it was from, and so on.

Sometimes encountered downloading a website picture, need the corresponding referer, otherwise can not download the picture, that is because they do the anti-theft chain, the principle is based on referer to determine whether the site is the address, if not, then refuse, if it is, you can download;

7. accept-encoding (file codec format)

Accept-encoding: Indicates how the browser can accept the encoding. Encoding differs from file format in order to compress files and speed up file delivery. The browser decodes the Web response after it receives it and then checks the file format, which in many cases can reduce the amount of download time.

Example: accept-encoding:gzip;q=1.0, identity; q=0.5, *;q=0

If multiple encoding are matched at the same time, in the order of Q values, in this case in order, Gzip is supported, the identity compression is encoded, and the gzip-enabled browser returns a gzip-encoded HTML page. If this domain server is not set in the request message, the client is assumed to be acceptable for various content encodings.

8. Accept-language (language type)

Accept-langeuage: Indicates the type of language that the browser can accept, such as en or en-us, English, en or ZH-CN, when the server is able to provide more than one language version.

9. Accept-charset (character encoding)

Accept-charset: Indicates the character encoding that the browser can accept.

Example: Accept-charset:iso-8859-1,gb2312,utf-8

Iso8859-1: usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages, and the default value for English browsers is iso-8859-1.
GB2312: Standard Simplified Chinese character set;
Utf-8:unicode is a variable-length character encoding that solves multiple language text display problems, enabling application internationalization and localization.

If the field is not set in the request message, the default is to accept any character set.

Ten. Cookies (Cookies)

Cookie: This property is used by the browser to send cookies to the server. Cookies are small data bodies that are stored in a browser, which can record user information related to the server, and can be used to implement conversational functions, which will be detailed later.

One. Content-type (post data type)

The type of content that is used in the Content-type:post request.

Example: Content-type = Text/xml; charset=gb2312:

Indicates that the message body of the request contains data of the plain text XML type, with the character encoding "gb2312".

Cookies and Session:

The interaction between the server and the client is limited to the request/response process and is disconnected after the end, and the server will consider the new client on the next request.

In order to maintain a link between them, let the server know that this is a request sent by a previous user, you must save the client's information in one place.

Cookie: Determines the user's identity by the information recorded on the client.

Session: Determines the user's identity by information logged on the server side.

Python crawler Basics (continue to add)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Basics (continue to add)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support