HTTP and HTTPS
HTTP protocol (Hypertext Transfer Protocol, Hypertext Transfer Protocol): is a way to publish and receive HTML pages.
HTTPS (hypertext Transfer Protocol over secure Socket layer) is simply the secure version of HTTP, which is added to the SSL layer under HTTP.
SSL (Secure Sockets layer) is mainly used for the secure transport Protocol of the Web, which encrypts the network connection at the transport layer and guarantees the security of data transmission on the Internet.
HTTP
The port number is 80
,
HTTPS
The port number is443
How HTTP Works
The crawler crawl process can be understood as 模拟浏览器操作的过程
.
The main function of a browser is to make a request to the server to display the network resources you choose in a browser window, which is a set of rules for computers to communicate over the network.
HTTP requests and Responses
HTTP communication consists of two parts: a client request message and a server response message
The process by which the browser sends an HTTP request:
When a user enters a URL in the address bar of the browser and presses the ENTER key, the browser sends an HTTP request to the HTTP server. HTTP requests are mainly divided into "Get" and "Post" methods.
When we enter the URL http://www.baidu.com in the browser, the browser sends a request to get the http://www.baidu.com HTML file, and the server sends the response file object back to the browser.
The browser parses the HTML in response and finds that it references a lot of other files, such as images files, CSS files, and JS files. The browser will automatically send the request again to get a picture, CSS file, or JS file.
When all the files are downloaded successfully, the Web page will be fully displayed according to the HTML syntax structure.
URL (abbreviation for uniform/universal Resource Locator): Uniform Resource Locator, which is an identifying method used to describe the addresses of web pages and other resources on the Internet in a complete manner.
Basic format:
Scheme://host[:p ort#]/path/.../[?query-string][#anchor]
- Scheme: Protocol (for example: HTTP, HTTPS, FTP)
- Host: The IP address or domain name of the server
- port#: Port of the server (default port 80 If you are going to the protocol defaults)
- Path: access to Resource paths
- Query-string: Parameters, Data sent to the HTTP server
- Anchor: Anchor (jumps to the specified anchor position of the Web page)
For example:
Client HTTP request
The URL simply identifies the location of the resource, and HTTP is used to commit and fetch the resource. The client sends an HTTP request to the server's request message, including the following format:
请求行
, 请求头部
, 空行
,请求数据
Four parts, the general format of the request message is given.
A typical example of an HTTP request
get https://www.baidu.com/HTTP/1.1Host:www.baidu.comConnection: keep-aliveupgrade-insecure-requests:1user-agent:mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 Safari/537.36accept:text/html,application/xhtml+xml, Application/xml;q=0.9,image/webp,*/*;q=0.8referer:http://www.baidu.com/accept-encoding:gzip, deflate, SDCH, braccept-language:zh-cn,zh;q=0.8,en;q=0.6cookie:baiduid=04e4001f34ea74ad4601512dd3c41a7b:fg=1; bidupsid=04e4001f34ea74ad4601512dd3c41a7b; pstm=1470329258; mcity=-343%3a340%3a; bduss= Nf0mvfimtvlcuh-q2mxq0m3stzgquz4n2hba1ffrkizudi3qlbczjg5cfdod1pzqvfbqufbjcqaaaaaaaaaaaeaaadplvgg0kgyvlrcyfrg-aaaaaaaaaaaaa AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFAQ3LDWQT5XN; h_ps_pssid=1447_18240_21105_21386_21454_21409_21554; bd_upn=12314753; sug=3; sugstore=0; origin=0; bdime=0; h_ps_645ec=7e2ad3qhl181nspbfbd7pruce1llufzxrcfmwyin0e6b%2bw8bbtmkhzbdp0g; Bdsvrtm=0
Request method
GET https://www.baidu.com/ HTTP/1.1
HTTP requests can use a variety of request methods, depending on the HTTP standard.
HTTP 0.9: Only the basic text GET function.
HTTP 1.0: Complete the request/response model and complement the Protocol, defining three methods of request: GET, POST, and head.
HTTP 1.1: Updated on a 1.0 basis with five new request methods: Options, PUT, DELETE, TRACE, and CONNECT methods.
HTTP 2.0 (not popular): the definition of the request/response header basically does not change, but all the first key must be all lowercase, and the request line to be Independent: method,: Scheme,: Host,:p ath these key value pairs.
Serial Number |
Method |
Description |
1 |
GET |
Requests the specified page information and returns the entity principal. |
2 |
HEAD |
Similar to a GET request, except that there is no specific content in the returned response to get the header |
3 |
POST |
Submits data to the specified resource for processing requests (such as submitting a form or uploading a file), and the data is included in the request body. A POST request may result in the creation of new resources and/or modification of existing resources. |
4 |
PUT |
Supersedes the contents of the specified document from the data that the client sends to the server. |
5 |
DELETE |
Requests that the server delete the specified page. |
6 |
CONNECT |
The http/1.1 protocol is reserved for proxy servers that can change connections to pipelines. |
7 |
OPTIONS |
Allows clients to view server performance. |
8 |
TRACE |
echo the requests received by the server, primarily for testing or diagnostics. |
HTTP requests are mainly divided into
Get
And
Post
Two methods
Get is the data that is fetched from the server and post is the data sent to the server
The GET request parameter is displayed on the browser URL, and the HTTP server generates the response based on the parameters in the URL that the request contains, that is, the parameters of the "Get" request are part of the URL. For example:http://www.baidu.com/s?wd=Chinese
Post request parameters in the request body, the message length is not limited and implicitly sent, usually used to submit to the HTTP Server a large amount of data (such as the request contains many parameters or file upload operations, etc.), the requested parameters are included in the "Content-type" message header, Indicates the media type and encoding of the message body,
Note: Avoid submitting forms by using GET, because they can cause security issues. For example, in the login form with get, the user entered the user name and password will be exposed in the address bar.
The usual request header 1. Host (hosts and port numbers)
Host: The Web name and port number in the URL that specifies the Internet host and port number of the requested resource, usually part of the URL.
2. Connection (link type)
Connection: Indicates the client-to-service connection type
The Client initiates a contained Connection:keep-alive
request that http/1.1 used keep-alive
as the default value.
After the server receives the request:
- If the Server supports keep-alive, reply to a response containing connection:keep-alive, do not close the connection;
- If the Server does not support keep-alive, reply to a response that contains connection:close and close the connection.
If the client receives Connection:keep-alive
the included response, the next request is sent to the same connection until the party actively closes the connection.
Keep-alive can reuse connections in many cases, reduce resource consumption, and shorten response times, such as when a browser needs multiple files (such as an HTML file and related graphics files), and does not need to request a connection every time.
3. Upgrade-insecure-requests (upgrade to HTTPS request)
Upgrade-insecure-requests: To upgrade an insecure request, which means that the HTTP resource is automatically replaced with an HTTPS request when it is loaded, so that the browser no longer displays an HTTP request alert in the HTTPS page.
HTTPS is a security-targeted HTTP channel, so HTTP requests are not allowed on HTTPS-hosted pages, as soon as a prompt or an error occurs.
4. User-agent (browser name)
User-agent: Is the name of the customer's browser and will be detailed later.
5. Accept (transfer file type)
Accept: Refers to the MIME (Multipurpose Internet Mail Extensions (Multipurpose Internet Message Extension)) file type acceptable to the browser or other client, which the server can determine and return the appropriate file format.
Example:
Accept: */*: Indicates what can be received. Accept:image/gif: A resource indicating that the client wants to accept the GIF image format; accept:text/html: Indicates that the client wants to accept HTML text. Accept:text/html, application/xhtml+xml;q=0.9, image/*;q=0.8: The MIME types that are supported by the browser are HTML text, XHTML and XML documents, and all image format resources.
Q is the weight factor, the greater the range 0 =< q <= 1,q value, the more the request tends to get its ";" The previous type represents the content. If the Q value is not specified, the default is 1, left-to-right sort order, and if assigned to 0, it is used to indicate that the browser does not accept this content type.
Text: Used to standardize the representation of textual information, text messages can be in a variety of character sets and or multiple formats; application: Used to transfer application data or binary data. For details, please click
6. Referer (page jump)
Referer: Indicates the URL from which the requested page was generated, and the user is accessing the page from the Referer page to the current request. This property can be used to track which page the Web request came from, what site it was from, and so on.
Sometimes encountered downloading a website picture, need the corresponding referer, otherwise can not download the picture, that is because they do the anti-theft chain, the principle is based on referer to determine whether the site is the address, if not, then refuse, if it is, you can download;
7. accept-encoding (file codec format)
Accept-encoding: Indicates how the browser can accept the encoding. Encoding differs from file format in order to compress files and speed up file delivery. The browser decodes the Web response after it receives it and then checks the file format, which in many cases can reduce the amount of download time.
Example: accept-encoding:gzip;q=1.0, identity; q=0.5, *;q=0
If multiple encoding are matched at the same time, in the order of Q values, in this case in order, Gzip is supported, the identity compression is encoded, and the gzip-enabled browser returns a gzip-encoded HTML page. If this domain server is not set in the request message, the client is assumed to be acceptable for various content encodings.
8. Accept-language (language type)
Accept-langeuage: Indicates the type of language that the browser can accept, such as en or en-us, English, en or ZH-CN, when the server is able to provide more than one language version.
9. Accept-charset (character encoding)
Accept-charset: Indicates the character encoding that the browser can accept.
Example: Accept-charset:iso-8859-1,gb2312,utf-8
- Iso8859-1: usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages, and the default value for English browsers is iso-8859-1.
- GB2312: Standard Simplified Chinese character set;
- Utf-8:unicode is a variable-length character encoding that solves multiple language text display problems, enabling application internationalization and localization.
If the field is not set in the request message, the default is to accept any character set.
Ten. Cookies (Cookies)
Cookie: This property is used by the browser to send cookies to the server. Cookies are small data bodies that are stored in a browser, which can record user information related to the server, and can be used to implement conversational functions, which will be detailed later.
One. Content-type (post data type)
The type of content that is used in the Content-type:post request.
Example: Content-type = Text/xml; charset=gb2312:
Indicates that the message body of the request contains data of the plain text XML type, with the character encoding "gb2312".
Server-Side HTTP response
The HTTP response is also made up of four parts, namely:,,, 状态行
消息报头
空行
响应正文
http/1.1 okserver:tengineconnection:keep-alivedate:wed, 07:58:21 Gmtcache-control:no-cachecontent-typ E:text/html;charset=utf-8keep-alive:timeout=20vary:accept-encodingpragma:no-cachex-nws-log-uuid: bd27210a-24e5-4740-8f6c-25dbafa9c395content-length:180945<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" ....
Common response Headers (learn)
In theory, all response header information should be in response to the request header. But for the sake of efficiency, security, and other considerations, the corresponding response header information is added, which can be seen from:
1. Cache-control:must-revalidate, No-cache, private.
This value tells the client that the server does not want the client to cache the resource, and the next time the resource is requested, it must be re-requested and cannot get the resource from the cached copy.
Cache-control is an important information in the response header, when the client request header contains a cache-control:max-age=0 request that explicitly indicates that the server resource is not cached, Cache-control as the response information, Usually returns No-cache, meaning, "then do not cache chant."
When the client does not contain Cache-control in the request header, the server will often be determined, different resources of different cache policies, such as Oschina in the cache image resources strategy is cache-control:max-age=86400, this means that Starting at the current time, the client can read the resource directly from the cached copy in 86,400 seconds, without having to request it from the server.
2. connection:keep-alive
This field responds to the client's connection:keep-alive and tells the client server that the TCP connection is also a long connection, and the client can continue to send HTTP requests using this TCP connection.
3. Content-encoding:gzip
Tells the client that the resource sent by the server is gzip encoded, and after the client sees this information, it should use gzip to decode the resource.
4. Content-type:text/html;charset=utf-8
Tells the client, the type of the resource file, the character encoding, the client decodes the resource through Utf-8, and then parses the resource in HTML. Usually we will see some of the website is garbled, often is the server side did not return the correct encoding.
5. Date:sun, Sep 06:18:21 GMT
This is the server time when the service sends resources, GMT is the standard time for Greenwich. The time sent in the HTTP protocol is GMT, which is mainly to solve the problem of time confusion on the internet and in different time zones when requesting resources from each other.
6. Expires:sun, 1 Jan 01:00:00 GMT
This response header is also related to the cache, telling the client before this time, can directly access the cache copy, it is obvious that the value of the problem, because the client and server time is not necessarily the same, if the time is different will cause problems. So this response head is not cache-control:max-age=* This response header is accurate, because the max-age=date in the date is a relative time, not only better understanding, but also more accurate.
7. Pragma:no-cache
This meaning is equivalent to Cache-control.
8.server:tengine/1.4.6
This is the server and the corresponding version, just tell the client server the information.
9. transfer-encoding:chunked
This response header tells the client that the resource sent by the server is chunked. General chunked send the resources are dynamically generated by the server, at the time of sending is not aware of the size of the sending resources, so the use of chunked send, each piece is independent, independent blocks can be labeled their length, the last piece is 0 length, when the client read this 0-length block, you can determine that the resources have been transmitted.
Ten. vary:accept-encoding
Tell the cache server, cache compressed files and uncompressed files two versions, now this field is not very useful, because the browser is now supported compression.
Response Status Code
The response status code consists of three digits, and the first number defines the category of the response, and there are five possible values.
Common Status Codes:
100~199
: Indicates that the server successfully received a partial request, requiring the client to continue submitting the remaining requests to complete the process.
200~299
: Indicates that the server successfully received the request and completed the entire processing process. Common (OK request successful).
300~399
: To complete the request, the customer needs to refine the request further. For example: The requested resource has been moved to a new address, common 302 (the requested page has been temporarily moved to a new URL), 307, and 304 (using cached resources).
400~499
: The client's request has an error, common 404 (the server cannot find the requested page), 403 (server denied access, insufficient permissions).
500~599
: An error occurred on the server side, Common 500 (request not completed.) The server is experiencing unpredictable conditions).
Cookies and Session:
The interaction between the server and the client is limited to the request/response process and is disconnected after the end, and the server will consider the new client on the next request.
In order to maintain a link between them, let the server know that this is a request sent by a previous user, you must save the client's information in one place.
Cookie: Determines the user's identity by the information recorded on the client.
Session: Determines the user's identity by information logged on the server side.
HTTP proxy artifact fiddler
Fiddler is a powerful Web debugging tool that can record HTTP requests from all clients and servers. Fiddler startup, the default IE proxy is set to 127.0.0.1:8888, while other browsers need to be set manually.
Working principle
Fiddler is working as a proxy Web server, which uses proxy addresses: 127.0.0.1, Port: 8888
Fiddler crawling HTTPS settings
Start Fiddler, open the Tools > Telerik Fiddler Options in the menu bar and open the Fiddler Options dialog box.
To set the Fiddler:
3. Configure Windows Trust this root certificate for Fiddler to resolve the security Warning: Trust root Certificate (Trusted root certificate).
4.Fiddler main Menu Tools, Fiddler options...-> Connections
5. Restart the fiddler for the configuration to take effect (this step is important and must be done).
Fiddler How to capture Chrome's session
Installing the Switchyomega Agent Management Chrome Browser plugin
2., set the proxy server to 127.0.0.1:8888
3. Switch to the set-up agent via the browser plugin.
Fiddler interface
When set, the native HTTP communication will pass through the 127.0.0.1:8888 proxy, and it will be intercepted by fiddler.
Request section
- headers--displays the header of the HTTP request sent by the client to the server, displayed as a hierarchical view that contains WEB client information, cookies, transmission status, and so on.
- textview--displays the body part of the POST request as text.
- webforms--displays the requested GET parameters and the POST body contents.
- hexview--displays the request with hexadecimal data.
- auth--Displays the Proxy-authorization (proxy authentication) and Authorization (authorization) information in the response header.
- raw--displays the entire request as plain text.
- JSON-Displays the JSON format file.
- xml--if the body of the request is in XML format, it is displayed with a hierarchical XML tree.
Response (Response) part of the explanation
- transformer--Displays the encoding information for the response.
- headers--displays the header of the response with a graduated view.
- The textview--uses text to display the corresponding body.
- imagevies--If the request is a picture resource, display a picture of the response.
- hexview--displays the response with hexadecimal data.
- The webview--responds to the preview effect in the Web browser.
- auth--Displays the Proxy-authorization (proxy authentication) and Authorization (authorization) information in the response header.
- caching--displays the cached information for this request.
- privacy--Displays the private (P3P) information for this request.
- raw--displays the entire response as plain text.
- JSON-Displays the JSON format file.
- xml--if the body of the response is in XML format, it is displayed with a hierarchical XML tree.
Next, let's really move on to our reptile path!
Basic use of the URLLIB2 library
The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local. There are many libraries in python that can be used to crawl Web pages, and we'll learn first urllib2
.
URLLIB2 is the Python2.7 module (no need to download, import can be used)
URLLIB2 Official Document: https://docs.python.org/2/library/urllib2.html
Urllib2 Source: https://hg.python.org/cpython/file/2.7/Lib/urllib2.py
urllib2
was changed to python3.x.urllib.request
Urlopen
Let's start with a piece of code:
# urllib2_urlopen.py# imports the URLLIB2 library import urllib2# sends a request to the specified URL and returns the server Response class file Object response = Urllib2.urlopen ("http/ Www.baidu.com ") # class file object supports the action method of a file object, such as the Read () method, which reads the entire contents of the file, returns the string html = Response.read () # Prints the string print HTML
Executes the written Python code that will print the result
[Email protected] ~$: Python urllib2_urlopen.py
In fact, if we open the Baidu homepage in the browser, right click "View Source Code", you will find, and we just printed out is exactly the same. In other words, the above 4 lines of code has helped us to Baidu's home page of all the code to crawl down.
A basic URL request that corresponds to the Python code is really simple.
Request
In our first example, the parameter of Urlopen () is a URL address;
However, if you need to perform more complex operations, such as adding HTTP headers, you must create a request instance as a parameter to Urlopen (), and the URL address you need to access as a parameter to the Request instance.
We edit urllib2_request.py
# urllib2_request.pyimport urllib2# URL as a parameter of the request () method, constructs and returns a Request object, request = Urllib2. The request ("http://www.baidu.com") # Request object is sent to the server as a parameter of the Urlopen () method and receives a response response = Urllib2.urlopen (request) HTML = Response.read () Print HTML
The result is exactly the same:
To create a new request instance, you can set two additional parameters in addition to the URL parameter:
Data (default NULL): is a file submitted with the URL (such as the data to post), and the HTTP request will be changed from "GET" mode to "POST" mode.
Headers (default NULL): is a dictionary that contains the key-value pairs of the HTTP headers that need to be sent.
These two parameters are mentioned below.
User-agent
But so directly with URLLIB2 to send a request to a website, indeed slightly abrupt, it is like, everyone has a door, you as a passer-by directly into the identity of the obvious is not very polite. And some sites do not like to be accessed by the program (non-human access), it is possible to deny your access requests.
But if we use a legitimate identity to request someone else's website, it is obvious that they are welcome, so we should add an identity to our code, which is called the User-Agent
head.
- The browser is the world's accepted identity of the Internet, if we want our bot to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different user-agent headers when sending requests. Urllib2
Add more header information
A complete HTTP request message is constructed by adding a specific Header to the HTTP requests.
You can Request.add_header()
Add/modify a specific header by calling or you can view an existing header by calling Request.get_header()
.
# Urllib2_headers.pyimport Urllib2url = "http://www.itcast.cn" #IE 9.0 of User-agentheader = {"User-agent": "mozilla/5.0 (c ompatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} Request = Urllib2. Request (url, headers = header) #也可以通过调用Request. Add_header () Add/Modify a specific Headerrequest.add_header ("Connection", " Keep-alive ") # can also be viewed by calling Request.get_header () to view header information # Request.get_header (header_name=" Connection ") response = Urllib2.urlopen (req) print response.code #可以查看响应状态码html = response.read () print HTML
- Randomly Add/Modify User-agent
# urllib2_add_headers.pyimport Urllib2import randomurl = "http://www.itcast.cn" ua_list = [ "mozilla/5.0 (Windows NT 6.1; ) Apple .... ", " mozilla/5.0 (X11; CrOS i686 2268.111.0) ... ", " mozilla/5.0 (Macintosh; U PPC Mac OS X .... ", " mozilla/5.0 (Macintosh; Intel Mac OS ... "]user_agent = Random.choice (ua_list) request = Urllib2. Request (URL) #也可以通过调用Request. Add_header () Add/Modify a specific Headerrequest.add_header ("User-agent", User_agent) # The first letter capitalized, All subsequent lowercase request.get_header ("user-agent") response = Urllib2.urlopen (req) HTML = response.read () print HTML
Review the HTTP protocol again