HTTP protocol (Hypertext Transfer protocal):
The Hypertext Transfer Protocol is a way to publish and receive HTML pages.
HTTPS protocol (hypertext Transfer protocal over Secure Socket Layer):
can be understood as the security version of HTTP, that is, the HTTP protocol based on the addition of the SSL layer.
SSL (secure Sockets layer Secure Sockets sub-layer):
It is primarily used for secure transport of the web, and data can be encrypted at the transport layer. Presented by Netscape Company.
Protocol Port number:
How the Crawler works:
Crawling of web crawlers can be understood as the process of simulating browser operation
The main function of the browser is to make a request to the server and display the resources returned by the server in the window.
HTTP requests and Responses:
HTTP communication consists of two parts: Client Request Message with the Server Response Message 650) this.width=650; "Src=" https://s3.51cto.com/oss/201711/03/ 854fdf255d485d8b4cb625080c261a7e.jpg-wh_500x0-wm_3-wmp_4-s_3406453426.jpg "title=" 02_http_pro.jpg "alt=" 854fdf255d485d8b4cb625080c261a7e.jpg-wh_ "/>
The process by which the browser sends an HTTP request:
when the user enters a URL in the address bar and returns, the browser sends an HTTP request to the HTTP server, with the longest HTTP request being GET and the POST Method
The browser sends a request to fetch the server-side HTML file, and the server returns a response object.
The browser parses the HTML in the response and finds that it references other files, such as images CSS JS, which the browser requests again to get those resources.
When all the files are successfully downloaded, the browser will compose the final page based on the syntax structure of the HTML.
Client HTTP Request:
The client sends an HTTP request to the server in the format:
Request Line | Request header | blank line | request Data
for the general format of the request message:
650) this.width=650; "Src=" Https://s5.51cto.com/oss/201711/03/640f422fad51b5143d3282387354176e.png-wh_500x0-wm_3 -wmp_4-s_755208695.png "title=" 01_request.png "alt=" 640f422fad51b5143d3282387354176e.png-wh_ "/>
An example of a typical HTTP request:
Get https://www.douban.com/ http/1.1host: www.douban.comconnection: keep-alivepragma: no-cacheCache-Control: no-cacheUser-Agent: Mozilla/5.0 (windows nt 10.0; WOW64) AppleWebKit/537.36 (Khtml, like gecko) chrome/64.0.3253.3 safari/ 537.36upgrade-insecure-requests: 1accept: text/html,application/xhtml+xml,application/xml;q=0.9, Image/webp,image/apng,*/*;q=0.8#accept-encoding: gzip, deflate, braccept-language: zh, En-us;q=0.9,en;q=0.8,zh-tw;q=0.7,zh-cn;q=0.6cookie: bid=2mybpxuz2yq; __yadk_uid= Bxinrohoukkeb7tkisiezyglyuyp2kxo; gr_user_id=14916ea7-aee0-43ad-83ee-7a236df37d47; viewed= "20451827_ 25861795 "; _vwo_uuid_v2=c055442d3b3854f97dde6ac4d757e5bc|34b0bccb8c4f1faab1336ba5e19cea3c; ll=" 108288 "; _ga=ga1.2.310445079.1508424221; ps=y; push_noty_num=0; push_doumail_num=0; _ _utmv=30149280.14370; ap=1; __utmz=30149280.1509712941.8.4.utmcsr=baidu|utmccn= (Organic) |utmcmd=organic; _pk_ref.100001.8cb4=%5b%22%22%2c%22%22% 2c1509723845%2c%22https%3a%2f%2fwww.baidu.com%2flink%3furl% 3dmlpogezkcppqdqzj-phnxbptzkvux6diiqswuidgr7pluzgf-adra2ucwynayejf%26wd%3d%26eqid% 3dfb0db111000170200000000359fc642a%22%5d; _pk_id.100001.8cb4= 280d7bc2f732b51c.1508424213.8.1509723845.1509712941.; _pk_ses.100001.8cb4=*; __utma=30149280.310445079.1508424221.1509712941.1509723846.9; __utmc= 30149280; __utmt=1; __utmb=30149280.1.10.1509723846
Generally when the data is captured, do not compress the data, that is, comment out the following line:
Accept-encoding:gzip, deflate, SDCH, BR
Request Method:
http requests have multiple methods, But the most common is and POST Two methods:
Get is to fetch data from the server, post is to transfer data to the server
The parameters for get requests are displayed on the browser URL, and the HTTP server generates the response based on the parameters contained in the request, that is, the parameters of the GET request become part of the URL.
The POST request parameter is stored in the request body (typically in the form), the message length is unrestricted and is transmitted implicitly, and is typically used to submit a large amount of data to the server, and the requested parameter is contained in the "Content-type" message header, indicating the media type and encoding of the message.
generally do not use get to submit forms because of the exposure of sensitive information that may be displayed
Common Request headers:
1.Host (hostname and port number): The Web name and port number of the corresponding URL, which specifies the Internet host number and port number of the requested resource.
2. Connection (link type): Indicates the connection type of the client to the server
3. upgrade-insecure-requests (upgrade to HTTPS request): Upgrade insecure requests, meaning they will be automatically replaced with HTTPS requests when loading HTTP resources. Let the browser no longer display alarms in HTTPS pages
4.user-agent (browser name)
5.Accept (File Transfer type): Refers to the type of MIME (Multipurpose Internet Mail Extension) that the browser or other client can accept, and the server can determine and return the appropriate file format based on it.
accept:*/* means anything can be received.
Image/gif represents a picture
6.Referer (page jump): Indicates the URL from which the requested page was generated, and the user is accessing the page from the Referer page to the current request. This property can be used to track which page the Web request came from, Source sites, and so on.
Sometimes you will encounter the download of a website picture, you need to use the corresponding referer, otherwise cannot download the picture, this is because the site did anti-theft chain , the principle is according to Referer judge whether is the address of this website, if is then can download, not then reject.
7.accept-encoding (File codec format): Refers to the way the browser can accept encoding. Encoding differs from file format, It is designed to compress files to speed up the transfer of files. The browser decodes the Web response before it is received, and then checks the file format.
8.accept-language (language type): Indicates the type of language that the browser can accept, such as en or en-us or ZH-CN. Used when the server is able to provide more than one language version.
9.Accept-charset (character encoding): Refers to the browser can accept the character encoding, if the request message does not set the domain, the default is any character set can be accepted.
Cookies: This property is used by browsers to send Cookie,cookie to the server, which is a small data body that is stored in a browser, which can record and server-related user information, and can also be used to implement session functions.
One .Content-type (post data type):P The type of content used in the OST request
Server-Side HTTP response:
The HTTP response is also made up of four parts, namely: Status line message header blank line response body
650) this.width=650; "Src=" Https://s2.51cto.com/oss/201711/04/568d434bef4fe3f651c99fc9ba1c54dc.jpg-wh_500x0-wm_3 -wmp_4-s_2331525469.jpg "title=" 01_response.jpg "alt=" 568d434bef4fe3f651c99fc9ba1c54dc.jpg-wh_ "/>
Common Status Codes:
100-199: Indicates that the server successfully receives a partial request, requiring the client to continue submitting the remaining requests to complete the process
200-299: Indicates that the server successfully received the request and has completed the entire processing process
300-399: In order to complete the request, the customer needs to further refine the request. Columns such as the requested resource has been moved to a new address (302 indicates that the requested page has been temporarily moved to a new URL 307 and 304 means that the cache is used)
400-499: Client request is incorrect (404: Server cannot find related page 403 server denied access, insufficient permissions)
500-599: Server side error, common 500 (Request not complete server unknown)
Cookies and session:
The interaction between the client and the server is limited to requests and responses, is broken after the end, the next interaction is considered a new connection, and in order for the server to record the user's state, a place must be found to record the user's information
Cookies: Identifying identities by information logged on the client
Session: Confirm identity by logging information on the server side
Introduction to HTTP and HTTPS