Objective
Previous Article Python crawler get started case----Crawl a station Shanghai rental pictures in the headers of the explanation, may be not enough to understand the crawler, so the old think this is a particularly simple technology, it may be simple so online to the crawler system documents, books and videos feel very little, So we are going to take this time to do a systematic study and summary of the points involved in reptiles.
View headers with a browser
Open the browser, press F12 (Development debugging tool)------"View Network work------" Select the page address you visited------"headers. You can see the information you want, such as ("supercilious" These are a bit of the development of the basics should know it)
We can see that heades contains (common) request headers (requests) with response Headers (response). From the name we'll probably know what their role is. This piece of knowledge can learn the HTTP protocol to understand, remember to buy a book called "Graphic http", interested to see a look.
Request headers
The first step of the crawler should be to get the page information, but that often people do not want you to crawl their site as for why? Please use your toes to think, in fact, my previous projects have also done to prevent crawler features, net MVC anti-network attack cases, then there will be oppression there is resistance, in which anti-crawler way headers forgery is the first step. I mainly mention the host,connection,accept,accept-encoding,accept-language,user-agent,referrer of the 7 request headers.
Host detailed
Everyone should know that the host is in the http1-1 after, that is, there is no host only the IP site can be normal operation, but why to join the host?
If we go to ping host. host:csblogs.com corresponding IP is 104.27.132.253, then I asked in this, there is no possibility blogs.com also corresponding 104.27.132.253 this IP address? The answer is yes, people who have done web development should deploy too many Web sites on their computers. Just need us to use a different port on the line. Yes, host is the domain name. He is mainly to achieve a one-to-many functions. A single IP on a virtual host can put thousands of websites. When a request for these sites arrives, the server determines which specific site the request is based on by the value in the host row, which is the domain name resolution.
Connection detailed
If there is no discovery request and the corresponding connection, then what is the use of it? Controls whether the HTTP/C/s can be long connected directly. HTTP1.1 Specifies that long connections are maintained by default, but the Python crawler may have short links. So what is a long connection?
data transfer is done to keep the TCP connection continuously open (no RST packet, no four handshake), waiting for the same domain name to continue to use this channel to transmit data; the opposite is a short connection.
One can simply set it up and pass it on.
connection:keep-alive# Long connection connection:close# Short link
KEEP-ALIVE:TIMEOUT=20#TCP Channel remains 20s
Accept Detailed
Specifies what type of content the client can accept, and the only thing to be reminded of is that it only advises the server, not what you write and what he returns to you.
Accept-encoding detailed
The browser is sent to the server, declaring the type of encoding supported by the browser.
Accept-encoding:compress, gzip //accept-encoding: //
accept-encoding: *
//
accept-encoding:compress;q=
0.5, gzip;q=
1 .0
//
accept-encoding:gzip;q=
1.0, identity; q=
0.5, *;q=
0
//
Accept-Language详解
The request header allows the client to declare the natural language it understands, as well as the preferred region dialect.
ACCEPT-LANGUAGE:ZH-CN, zh;q=0.8, en-gb;q=0.8, en;q=0.7#最佳语言为中文-China (default weight is 1), followed by Chinese, weighted to 0. 8, again for British English, with a weight of 0. 8, the last is general English, weight 0. 7
User_agent detailed
Provide access to the website with the type and version of the browser you are using, the operating system and version, the browser kernel, and other information. Through this logo, users visit the site can display a different layout to provide users with better experience or information statistics, such as mobile phone access to Google and computer access is not the same, these are Google according to the UA of visitors to judge, this should be all contact crawler whether it does not know what it means will use it, Because without it, most will not respond.
#User_agent CollectionUser_agent_list = [ 'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko)' 'chrome/45.0.2454.85 safari/537.36 115browser/6.0.3', 'mozilla/5.0 (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50', 'mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50', 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0)', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1', 'opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11', 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0)', 'mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0', 'mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1',]#randomly Select aUser_agent =Random.choice (user_agent_list)#Pass to Header#headers = {' User-agent ': user_agent}
Why not send a random one? In fact, most of the time I use one. In fact, you forge the more you don't want to crawl the better the reptile.
Referer detailed
When the browser sends a request to the Web server, it usually takes Referer and tells the server which page link I took from, and the server can get some information for processing. Used for statistic traffic, anti-external connection, etc. How do you say that? If you want to check to see if there is a train ticket, you must first log in to the 12306 website.
# against "anti-hotlinking" (the server will recognize that the referer in headers is not itself, if not it does not respond) , build the following headers headers = {"user-agent""mozilla/4.0 (compatible; MSIE 5.5; Windows NT)", "Referer"" https:// www.cnblogs.com"}
Other
Authorization: Authorization information, which usually appears in the response to the Www-authenticate header sent to the server;
Cookies: This is one of the most important request header information; can be copied directly, for some variations of the optional construction (some libraries in Python can also be implemented)
( This I am going to introduce separately )
from: The email address of the requesting sender, which is used by some special Web client, which is not used by the browser;
if-modified-since: Returns a 304 "not Modified" answer only if the requested content has been modified after the specified date;
Pragma: Specifying a value of "No-cache" indicates that the server must return a refreshed document, even if it is a proxy server and has a local copy of the page;
ua-pixels,ua-color,ua-os,ua-cpu: A nonstandard request header sent by some versions of Internet Explorer to indicate screen size, color depth, operating system, and CPU type.
Origin: The Origin field contains only who initiated the request, and there is no additional information. Unlike Referer, it is especially important that the Origin field does not contain URL paths and request content that involve user privacy.
And the Origin field exists only in the POST request, while Referer is present in all types of requests;
Conclusion
Then first wrote here, the last article said the Spring Festival before the final article, the result is not suppressed, in this wish you all Happy New Year. Let's work together in the coming year.
Python 3.x Crawler Basics---HTTP headers detailed