Written in front
The topic is not the goal, mainly for more detailed understanding of the site's reverse climbing mechanism, if you really want to improve the amount of reading blog, high-quality content is essential.
Learn about the Web site's anti-crawling mechanism
General Web sites from the following several aspects of the anti-reptile:
1. Through headers anti-reptile
The headers anti-crawler from the user request is the most common anti-reptile strategy. Many sites will be headers user-agent to detect, there are a number of sites will be detected referer (some resources of the site's anti-theft chain is to detect referer).
If you encounter such an anti-reptile mechanism, you can add headers directly to the crawler, copy the browser's user-agent to the headers of the crawler, or modify the Referer value to the target site domain name. For detecting headers, modifying or adding headers in a reptile can be a good bypass.
2. Anti-crawler based on user behavior
There are also a number of sites that detect user behavior, such as the same IP for a short period of time to access the same page, or the same account for a short period of time to do the same operation.
Most sites are the former, and in this case, IP proxies can be used to solve them. We can save agent IP detection in the file, but this method is not desirable, the possibility of proxy IP failure is very high, so from the dedicated agent IP site real-time crawl, is a good choice.
For the second case, you can make the next request randomly spaced a few seconds after each request. Some Web sites with logical vulnerabilities can be requested several times, log off, log on again, and continue with the request to bypass the same account for a short period of time without limiting the same request.
And for cookies, by checking cookies to determine whether the user is a valid user, need to log on to the site often use this technology. Further further, some Web site logins will dynamically update the validation, such as a push-cool login, and the authenticity_token,authenticity_token that are randomly assigned for logon verification will be sent back to the server along with the user's submitted login and password.
3. Anti-crawler based on dynamic page
Sometimes the target page crawled down, found that the key information content blank, only frame code, this is because the site's information is through the user post XHR dynamic return content information, the solution to this problem is through the developer tool (Firebug, etc.) on the site flow analysis, Find individual content Information request (such as JSON), crawl content information, get what you need.
More complex and dynamic request encryption, parameters can not be resolved, can not be crawled. In this case, you can through the Mechanize,selenium RC, call the browser kernel, like the real use of the browser to crawl the Internet, you can maximize the success of the crawl, but the efficiency will be discount. The author has tested, using Urllib grab pull Hook Network recruitment information 30 pages required time is more than 30 seconds, and with the simulation browser kernel crawl needs 2-3 minutes.
4. Restrict certain IP access
Free proxy IP can be obtained from a number of Web sites, since the crawler can use these proxy IP Web site crawl, the site can also use these proxy IP reverse restrictions, by crawling these IP saved in the server to limit the use of proxy IP crawl crawler.
Get to the point.
OK, now actually, write a crawler through the proxy IP access site.
First get the proxy IP, used to crawl.
' Host ': ' www.xicidaili.com ',
' User-agent ': ' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) ',
' Accept ': R ' Application/json, Text/javascript, */*; q=0.01 ',
' Referer ': R ' http://www.xicidaili.com/',
Req=request. Request (R ' http://www.xicidaili.com/nn/', headers=headers) #发布代理IP的网站
Response=request.urlopen (req)
Html=response.read (). Decode (' Utf-8 ')
Ip_list=re.findall (R ' \d+\.\d+\.\d+\.\d+ ', HTML)
Port_list=re.findall (R ' <td>\d+</td> ', HTML)
Foriinrange (Len (ip_list)):
Port=re.sub (R ' <td>|</td> ', ', Port_list[i])
proxy= '%s:%s '% (ip,port)
Proxy_list.append (proxy)
|
Incidentally, some sites will be checked by the real IP agent IP to limit the crawler crawl. Here is a little mention of IP agent knowledge.
Proxy IP in the "Transparent" "anonymous" "High Hide" is refers to?
A transparent proxy means that the client does not need to know the presence of a proxy server at all, but it still transmits the real IP. With transparent IP, you cannot bypass the limit of the number of IP accesses over a given period of time.
Ordinary anonymous agents can hide the client's real IP, but will change our request information, the server may think we use the agent. However, when using this kind of proxy, although the visited website does not know your IP address, but still can know you are using the proxy, such IP will be banned by the website.
The high anonymous proxy does not change the client's request, so that the server seems to have a real client browser to access it, when the customer's real IP is hidden, the site will not think we use the agent.
To sum up, the Reptile agent IP is best to use "High Hide IP"
User_agent_list contains the Requestheaders user-agent that are currently requested by the mainstream browser, and it allows us to mimic the requests of various browsers.
' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '
' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',
' Mozilla/5.0 (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50 ',
' Mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50 ',
' Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) ',
' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) ',
' mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1 ',
' opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11 ',
' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11 ',
' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) ',
' Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0 ',
' mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1 ',
|
By setting a random wait time to access a Web site, you can bypass certain Web sites for restrictions on request intervals.
Defproxy_read (Proxy_list, User_agent_list, i):
Print (' Current agent ip:%s '%proxy_ip)
User_agent=random.choice (User_agent_list)
Print (' Current agent user_agent:%s '%user_agent)
Sleep_time=random.randint (1,3)
Print (' Wait time:%s S '%sleep_time)
Time.sleep (sleep_time) #设置随机等待时间
' Host ': ' S9-im-notify.csdn.net ',
' Origin ': ' Http://blog.csdn.net ',
' User-agent ': user_agent,
' Accept ': R ' Application/json, Text/javascript, */*; q=0.01 ',
' Referer ': R ' http://blog.csdn.net/u010620031/article/details/51068703 ',
Proxy_support=request. Proxyhandler ({' http ':p roxy_ip})
Opener=request.build_opener (Proxy_support)
Request.install_opener (opener)
Req=request. Request (R ' http://blog.csdn.net/u010620031/article/details/51068703 ', headers=headers)
Html=request.urlopen (req). Read (). Decode (' Utf-8 ')
Print (' Open failed! ******')
Print (' ok! total successfully%s times! '%count)
|
The above is the crawler to use the relevant knowledge of the agent, although still very simple, but most of the scene can be dealt with.
Attached source code
' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '
' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',
' Mozilla/5.0 (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50 ',
' Mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50 ',
' Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) ',
' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) ',
' mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1 ',
' opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11 ',
' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11 ',
' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) ',
' Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0 ',
' mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1 ',
' Host ': ' www.xicidaili.com ',
' User-agent ': ' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) ',
' Accept ': R ' Application/json, Text/javascript, */*; q=0.01 ',
' Referer ': R ' http://www.xicidaili.com/',
Req=request. Request (R ' http://www.xicidaili.com/nn/', headers=headers)
Response=request.urlopen (req)
Html=response.read (). Decode (' Utf-8 ')
Ip_list=re.findall (R ' \d+\.\d+\.\d+\.\d+ ', HTML)
Port_list=re.findall (R ' <td>\d+</td> ', HTML)
Foriinrange (Len (ip_list)):
Port=re.sub (R ' <td>|</td> ', ', Port_list[i])
proxy= '%s:%s '% (ip,port)
Proxy_list.append (proxy)
Defproxy_read (Proxy_list, User_agent_list, i):
Print (' Current agent ip:%s '%proxy_ip)
User_agent=random.choice (User_agent_list)
Print (' Current agent user_agent:%s '%user_agent)
Sleep_time=random.randint (1,3)
Print (' Wait time:%s S '%sleep_time)
' Host ': ' S9-im-notify.csdn.net ',
' Origin ': ' Http://blog.csdn.net ',
' User-agent ': user_agent,
' Accept ': R ' Application/json, Text/javascript, */*; q=0.01 ',
' Referer ': R ' http://blog.csdn.net/u010620031/article/details/51068703 ',
Proxy_support=request. Proxyhandler ({' http ':p roxy_ip})
Opener=request.build_opener (Proxy_support)
Request.install_opener (opener)
Req=request. Request (R ' http://blog.csdn.net/u010620031/article/details/51068703 ', headers=headers)
Html=request.urlopen (req). Read (). Decode (' Utf-8 ')
Print (' Open failed! ******')
Print (' ok! total successfully%s times! '%count)
if__name__== ' __main__ ':
PROXY_LIST=GET_PROXY_IP ()
Proxy_read (Proxy_list, user_agent_list, i)
|