A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
HTTP protocol (Hypertext Transfer Protocol, Hypertext Transfer Protocol): is a way to publish and receive HTML pages.
HTTPS (hypertext Transfer Protocol over secure Socket layer) is simply the secure version of HTTP, which is added to the SSL layer under HTTP.
SSL (Secure Sockets layer) is mainly used for the secure transport Protocol of the Web, which encrypts the network connection at the transport layer and guarantees the security of data transmission on the Internet.
HTTPThe port number is
HTTPSThe port number is
The crawler crawl process can be understood as
The main function of a browser is to make a request to the server to display the network resources you choose in a browser window, which is a set of rules for computers to communicate over the network.HTTP requests and Responses
HTTP communication consists of two parts: a client request message and a server response messageThe process by which the browser sends an HTTP request:
When a user enters a URL in the address bar of the browser and presses the ENTER key, the browser sends an HTTP request to the HTTP server. HTTP requests are mainly divided into "Get" and "Post" methods.
When we enter the URL http://www.baidu.com in the browser, the browser sends a request to get the http://www.baidu.com HTML file, and the server sends the response file object back to the browser.
The browser parses the HTML in response and finds that it references a lot of other files, such as images files, CSS files, and JS files. The browser will automatically send the request again to get a picture, CSS file, or JS file.
When all the files are downloaded successfully, the Web page will be fully displayed according to the HTML syntax structure.
URL (abbreviation for uniform/universal Resource Locator): Uniform Resource Locator, which is an identifying method used to describe the addresses of web pages and other resources on the Internet in a complete manner.
The URL simply identifies the location of the resource, and HTTP is used to commit and fetch the resource. The client sends an HTTP request to the server's request message, including the following format:
Four parts, the general format of the request message is given.A typical example of an HTTP request
GET https://www.baidu.com/HTTP/1.1host:www.baidu.comconnection:keep-aliveupgrade-insecure-requests:1user-agent:mozilla/5.0 (Windows NT10.0; WIn64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8referer:http://www.baidu.com/accept-encoding:gzip, deflate, SDCH, braccept-language:zh-cn,zh;q=0.8,en;q=0.6cookie:baiduid=04e4001f34ea74ad4601512dd3c41a7b:fg=1; Bidupsid=04e4001f34ea74ad4601512dd3c41a7b; pstm=1470329258; mcity=-343%3a340%3 A; bduss= Nf0mvfimtvlcuh-q2mxq0m3stzgquz4n2hba1ffrkizudi3qlbczjg5cfdod1pzqvfbqufbjcqaaaaaaaaaaaeaaadplvgg0kgyvlrcyfrg-aaaaaaaaaaaaa AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFAQ3LDWQT5XN; H_ps_pssid=1447_18240_21105_21386_21454_21409_21554; bd_upn=12314753; sug=3; sugstore=0; origin=0; bdime=0; h_ps_645ec=7e2ad3qhl181nspbfbd7pruce1llufzxrcfmwyin0e6b%2bw8bbtmkhzbdp0g; bdsvrtm=0
GET https://www.baidu.com/ HTTP/1.1
HTTP requests can use a variety of request methods, depending on the HTTP standard.
HTTP 0.9: Only the basic text GET function.
HTTP 1.0: Complete the request/response model and complement the Protocol, defining three methods of request: GET, POST, and head.
HTTP 1.1: Updated on a 1.0 basis with five new request methods: Options, PUT, DELETE, TRACE, and CONNECT methods.
HTTP 2.0 (not popular): the definition of the request/response header basically does not change, but all the first key must be all lowercase, and the request line to be Independent: method,: Scheme,: Host,:p ath these key value pairs.
|1||GET||Requests the specified page information and returns the entity principal.|
|2||HEAD||Similar to a GET request, except that there is no specific content in the returned response to get the header|
|3||POST||Submits data to the specified resource for processing requests (such as submitting a form or uploading a file), and the data is included in the request body. A POST request may result in the creation of new resources and/or modification of existing resources.|
|4||PUT||Supersedes the contents of the specified document from the data that the client sends to the server.|
|5||DELETE||Requests that the server delete the specified page.|
|6||CONNECT||The http/1.1 protocol is reserved for proxy servers that can change connections to pipelines.|
|7||OPTIONS||Allows clients to view server performance.|
|8||TRACE||echo the requests received by the server, primarily for testing or diagnostics.|
Get is the data that is fetched from the server and post is the data sent to the server
The GET request parameter is displayed on the browser URL, and the HTTP server generates the response based on the parameters in the URL that the request contains, that is, the parameters of the "Get" request are part of the URL. For example:
Post request parameters in the request body, the message length is not limited and implicitly sent, usually used to submit to the HTTP Server a large amount of data (such as the request contains many parameters or file upload operations, etc.), the requested parameters are included in the "Content-type" message header, Indicates the media type and encoding of the message body,
Note: Avoid submitting forms by using GET, because they can cause security issues. For example, in the login form with get, the user entered the user name and password will be exposed in the address bar.The usual request header 1. Host (hosts and port numbers)
Host: The Web name and port number in the URL that specifies the Internet host and port number of the requested resource, usually part of the URL.2. Connection (link type)
Connection: Indicates the client-to-service connection type
The Client initiates a contained
Connection:keep-alive request that http/1.1 used
keep-alive as the default value.
After the server receives the request:
If the client receives
Connection:keep-alive the included response, the next request is sent to the same connection until the party actively closes the connection.
Keep-alive can reuse connections in many cases, reduce resource consumption, and shorten response times, such as when a browser needs multiple files (such as an HTML file and related graphics files), and does not need to request a connection every time.3. Upgrade-insecure-requests (upgrade to HTTPS request)
Upgrade-insecure-requests: To upgrade an insecure request, which means that the HTTP resource is automatically replaced with an HTTPS request when it is loaded, so that the browser no longer displays an HTTP request alert in the HTTPS page.
HTTPS is a security-targeted HTTP channel, so HTTP requests are not allowed on HTTPS-hosted pages, as soon as a prompt or an error occurs.4. User-agent (browser name)
User-agent: Is the name of the customer's browser and will be detailed later.5. Accept (transfer file type)
Accept: Refers to the MIME (Multipurpose Internet Mail Extensions (Multipurpose Internet Message Extension)) file type acceptable to the browser or other client, which the server can determine and return the appropriate file format.Example:
Accept: */*: Indicates that anything can be received.
Accept：image/gif: Indicates the client wants to accept the GIF image format resources;
Accept：text/html: Indicates that the client wants to accept HTML text.
Accept: text/html, application/xhtml+xml;q=0.9, image/*;q=0.8: Indicates that the MIME types supported by the browser are HTML text, XHTML and XML documents, and all image format resources.
Q is the weight factor, the greater the range 0 =< q <= 1,q value, the more the request tends to get its ";" The previous type represents the content. If the Q value is not specified, the default is 1, left-to-right sort order, and if assigned to 0, it is used to indicate that the browser does not accept this content type.
Text: Used to standardize the representation of textual information, text messages can be in a variety of character sets and or multiple formats; application: Used to transfer application data or binary data. For details, please click6. Referer (page jump)
Referer: Indicates the URL from which the requested page was generated, and the user is accessing the page from the Referer page to the current request. This property can be used to track which page the Web request came from, what site it was from, and so on.
Sometimes encountered downloading a website picture, need the corresponding referer, otherwise can not download the picture, that is because they do the anti-theft chain, the principle is based on referer to determine whether the site is the address, if not, then refuse, if it is, you can download;7. accept-encoding (file codec format)
Accept-encoding: Indicates how the browser can accept the encoding. Encoding differs from file format in order to compress files and speed up file delivery. The browser decodes the Web response after it receives it and then checks the file format, which in many cases can reduce the amount of download time.Example: accept-encoding:gzip;q=1.0, identity; q=0.5, *;q=0
If multiple encoding are matched at the same time, in the order of Q values, in this case in order, Gzip is supported, the identity compression is encoded, and the gzip-enabled browser returns a gzip-encoded HTML page. If this domain server is not set in the request message, the client is assumed to be acceptable for various content encodings.8. Accept-language (language type)
Accept-langeuage: Indicates the type of language that the browser can accept, such as en or en-us, English, en or ZH-CN, when the server is able to provide more than one language version.9. Accept-charset (character encoding)
Accept-charset: Indicates the character encoding that the browser can accept.Example: Accept-charset:iso-8859-1,gb2312,utf-8
If the field is not set in the request message, the default is to accept any character set.Ten. Cookies (Cookies)
Cookie: This property is used by the browser to send cookies to the server. Cookies are small data bodies that are stored in a browser, which can record user information related to the server, and can be used to implement conversational functions, which will be detailed later.One. Content-type (post data type)
The type of content that is used in the Content-type:post request.Example: Content-type = Text/xml; charset=gb2312:
Indicates that the message body of the request contains data of the plain text XML type, with the character encoding "gb2312".Server-Side HTTP response
The HTTP response is also made up of four parts, namely:,,,
Common response Headers (learn)
HTTP/1.1 200 OKServer: TengineConnection: keep-aliveDate: Wed, 30 Nov 2016 07:58:21 GMTCache-Control: no-cacheContent-Type: text/html;charset=UTF-8Keep-Alive: timeout=20Vary: Accept-EncodingPragma: no-cacheX-NWS-LOG-UUID: bd27210a-24e5-4740-8f6c-25dbafa9c395Content-Length: 180945<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ....
In theory, all response header information should be in response to the request header. But for the sake of efficiency, security, and other considerations, the corresponding response header information is added, which can be seen from:1. Cache-control:must-revalidate, No-cache, private.
This value tells the client that the server does not want the client to cache the resource, and the next time the resource is requested, it must be re-requested and cannot get the resource from the cached copy.
Cache-control is an important information in the response header, when the client request header contains a cache-control:max-age=0 request that explicitly indicates that the server resource is not cached, Cache-control as the response information, Usually returns No-cache, meaning, "then do not cache chant."
When the client does not contain Cache-control in the request header, the server will often be determined, different resources of different cache policies, such as Oschina in the cache image resources strategy is cache-control:max-age=86400, this means that Starting at the current time, the client can read the resource directly from the cached copy in 86,400 seconds, without having to request it from the server.
This field responds to the client's connection:keep-alive and tells the client server that the TCP connection is also a long connection, and the client can continue to send HTTP requests using this TCP connection.3. Content-encoding:gzip
Tells the client that the resource sent by the server is gzip encoded, and after the client sees this information, it should use gzip to decode the resource.4. Content-type:text/html;charset=utf-8
Tells the client, the type of the resource file, the character encoding, the client decodes the resource through Utf-8, and then parses the resource in HTML. Usually we will see some of the website is garbled, often is the server side did not return the correct encoding.5. Date:sun, Sep 06:18:21 GMT
This is the server time when the service sends resources, GMT is the standard time for Greenwich. The time sent in the HTTP protocol is GMT, which is mainly to solve the problem of time confusion on the internet and in different time zones when requesting resources from each other.6. Expires:sun, 1 Jan 01:00:00 GMT
This response header is also related to the cache, telling the client before this time, can directly access the cache copy, it is obvious that the value of the problem, because the client and server time is not necessarily the same, if the time is different will cause problems. So this response head is not cache-control:max-age=* This response header is accurate, because the max-age=date in the date is a relative time, not only better understanding, but also more accurate.7. Pragma:no-cache
This meaning is equivalent to Cache-control.8.server:tengine/1.4.6
This is the server and the corresponding version, just tell the client server the information.9. transfer-encoding:chunked
This response header tells the client that the resource sent by the server is chunked. General chunked send the resources are dynamically generated by the server, at the time of sending is not aware of the size of the sending resources, so the use of chunked send, each piece is independent, independent blocks can be labeled their length, the last piece is 0 length, when the client read this 0-length block, you can determine that the resources have been transmitted.Ten. vary:accept-encoding
Tell the cache server, cache compressed files and uncompressed files two versions, now this field is not very useful, because the browser is now supported compression.Response Status Code
The response status code consists of three digits, and the first number defines the category of the response, and there are five possible values.Common Status Codes:
100~199: Indicates that the server successfully received a partial request, requiring the client to continue submitting the remaining requests to complete the process.
200~299: Indicates that the server successfully received the request and completed the entire processing process. Common (OK request successful).
300~399: To complete the request, the customer needs to refine the request further. For example: The requested resource has been moved to a new address, common 302 (the requested page has been temporarily moved to a new URL), 307, and 304 (using cached resources).
400~499: The client's request has an error, common 404 (the server cannot find the requested page), 403 (server denied access, insufficient permissions).
500~599: An error occurred on the server side, Common 500 (request not completed.) The server is experiencing unpredictable conditions).
The interaction between the server and the client is limited to the request/response process and is disconnected after the end, and the server will consider the new client on the next request.
In order to maintain a link between them, let the server know that this is a request sent by a previous user, you must save the client's information in one place.
Cookie: Determines the user's identity by the information recorded on the client.
Session: Determines the user's identity by information logged on the server side.
1.2. (review) Http/https's request and response
Start building with 50+ products and up to 12 months usage for Elastic Compute Service