On anti-reptile "policies and Countermeasures"

Last Update:2018-07-24 Source: Internet

Author: User

Tags redis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Write a blog, part of it in order to let oneself in the future can quickly review the knowledge learned before, sorting out the idea, on the other hand, to help others also encounter similar problems of child shoes. But blogging is hard to keep, for reasons, all sorts of things. But in the final analysis there is no "resonance".

Mountain and flowing mountain, difficult to find bosom friend.

In fact, the establishment of the habit of blogging, is those little bit of small things: every day watching the blog visits, the number of praise increase; see their articles are others comments and so on.

Well, don't say much nonsense. Today to talk about the brush browsing volume problem. Although this is far from the original intention of blogging, but understand this kind of problem is good, after all, "technology is not illegal." ”。 anti-(anti-) crawler mechanism

When it comes to the Anti crawler, we have to say a reptile. In fact, this is a concept, the crawler is to manually complete the matter to the code to automate the implementation of it. The reverse crawler is a means of probing whether a user is a real user or a code. Anti-anti-crawler is a means of anti-reptile mechanism.

All say "double negative, affirmative", then the crawler and the anti-anti crawler should be the same. In fact, the surface behavior is consistent, but actually the anti-anti crawler does more processing, rather than a simple little reptile.

In general, the anti-reptile will start with the following levels:
-header browser's request header
-User-agent user agent, indicating a way to access the source identity
-Referer access to the target link from which link to jump over (do the chain of anti-theft, you can start from it)
-Host homology address judgment, use it will be useful
-IP the same IP for a short time multiple access, it is very likely that the crawler, the reverse crawler will do to deal with this.
-Access frequency: short time multiple high concurrent access, basically is a problem of access.
These are common anti-reptile measures, and of course there are more advanced mechanisms, such as the most disgusting captcha (using tesseract to handle simpler authentication code recognition), user behavior analysis, and so on.

Now that we understand the common anti-reptile mechanism, the corresponding "policy-countermeasures" to achieve the anti-anti-reptile is not so no clue. Yes, we will have some countermeasures against the above restrictions. For user-agent, you can organize some common browser proxy headers, each access randomly use one of the good. For IP, you can use proxy IP for the frequency limit, do the next access clearance to do a random sleep is pretty good. ...... actual Combat

I have been in the CSDN blog, its anti-reptile mechanism to tell the truth, do a relatively shallow, on the one hand, the necessity is not very big, and then to do the anti-reptile agent is not very cost-effective, it is estimated that they do not want to waste it.

Therefore, in the CSDN brush browsing volume is very casual, the next my thinking.
-Proxy IP Crawl, verify cleaning data, update regularly.
-Browser User-agent collation, add the randomness of access.
-The hibernate policy, log processing, error logging, timed retry, and so on. Proxy IP Processing

# Coding:utf8 # @Author: Guopu # @File: proxyip.py # @Tim  E:2017/10/5 # @Contact: 1064319632@qq.com # @blog: Http://blog.csdn.net/marksinoberg #

@Description: Crawl proxy IP and save to Redis related key import requests from BS4 import BeautifulSoup from redishelper import redishelper
    Class Proxyip (object): "" Crawl agent IP, cleaning, validation. "" "Def __init__ (self): Self.rh = Redishelper () def crawl (self):" ", whether HTTP or https are stored in the
        Said. "" "" # First handle HTTP mode proxy IP httpurl = "http://www.xicidaili.com/nn/" headers = {' User-agent ' : ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 '} HTML = Requests.get (url=
        Httpurl, headers=headers). Text soup = beautifulsoup (html, "Html.parser") ips = Soup.find_all ("tr")
         For index in range (1, Len (IPS)):   TDS = Ips[index].find_all (' td ') IP = tds[1].text Port = Tds[2].text Ipinfo = "{}:{}" ". Format (IP, port) if Self._check (IP): self.rh.sAddAvalibeIp (ipinfo) # print (Ipinf o def _check (self, IP): "" Detect the validity of proxy IP "" Checkurl = "Http://47.94.19.186/common /checkip.php "Localip = Self._getlocalip () # Print (" Local: {}, Proxy: {} ". Format (Localip, IP)) ret Urn False if localip==ip else True def _getlocalip (self): "" "To get the IP address of this machine, the interface is not very reliable, temporarily by hand in https://www  . BAIDU.COM/S?IE=UTF-8&AMP;WD=IP manual copy and paste can be "" "Return" 223.91.239.159 "def clean (self): ips = Self.rh.sGetAllAvalibleIps () for Ipinfo in Ips:ip, port = Ipinfo.split (":") if Self._che

    CK (IP): self.rh.sAddAvalibeIp (Ipinfo) else:self.rh.sRemoveAvalibleIp (ipinfo)
 def update (self):       Pass if __name__ = = ' __main__ ': pip = Proxyip () # result = Pip._check ("223.91.239.159", 53281) # Prin T (Result) pip.crawl () # Pip.clean ()

Redis Tool Class

# Coding:utf8 # @Author: Guopu # @File: redishelper.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com # @blog: http://blog.csdn.net/marksinobe
    RG # @Description: Some of the how-to tools involved in Redis Import Redis class Redishelper (object): "" To save a link to the blog content crawled. Save proxy IP "" "Def __init__ (self): Self.articlepool =" Redis:set:article:pool "self.avalibleips =" Red " Is:set:avalible:ips "self.unavalibleips =" redis:set:unavalibe:ips "pool = Redis. ConnectionPool (host= "localhost", port=6379) Self.redispool = Redis.
        Redis (Connection_pool=pool) def saddarticleid (self, ArticleID): "" To add a crawl to the blog ID. :p Aram ArticleID:: Return: "" "Self.redispool.sadd (Self.articlepool, ArticleID) def sremovear
      Ticleid (self, ArticleID): Self.redispool.srem (Self.articlepool, ArticleID) def Popuparticleid (self):  return int (Self.redispool.srandmember (self.articlepool)) def saddavalibeip (self, IP): Self.redispool.sadd (S Elf.avalibleips, IP) def sremoveavalibeip (self, IP): Self.redispool.srem (Self.avalibleips, IP) def sgetall Avalibleips (self): return [Ip.decode (' UTF8 ') to IP in Self.redispool.smembers (self.avalibleips)] def Popupav Alibeip (self): return Self.redispool.srandmember (self.avalibleips) def saddunavalibeip (self, IP): sel F.redispool.sadd (Self.unavalibleips, IP) def sremoveunavaibleip (self, IP): Self.redispool.srem (self.unavalible


 IPS, IP)

CSDN Bowen Tool Class

# Coding:utf8 # @Author: Guopu # @File: csdn.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com # @blog: Http://blog.csdn.net/marksinoberg # @D
Escription: Crawl A blogger's full blog link Tool class and other designed operations.
    Import re import requests from BS4 import BeautifulSoup class Blogscanner (object): "" Crawl all article link IDs under the Bo Master ID. "" "Def __init__ (self, bloger=" Marksinoberg "): Self.bloger = bloger # self.blogpagelink =" Http://blog . csdn.net/{}/article/list/{} ". Format (Self.bloger, 1) def _gettotalpages (self): Blogpagelink =" Http://blog.cs
        Dn.net/{}/article/list/{}?viewmode=contents ". Format (Self.bloger, 1) HTML = requests.get (url=blogpagelink). Text Soup = BeautifulSoup (html, "Html.parser") # compare hack operation, actual development or not so casual good temptext = soup.find (' div ', {"cl Ass ":" PageList "}). Find (" span "). Get_text () Restr = Re.findall (Re.compile (\d+). *? (
 \d+) "), Temptext)       # print (restr) pages = Restr[0][-1] return pages def _parsepage (self, pagenumber): Blo Gpagelink = "http://blog.csdn.net/{}/article/list/{}?viewmode=contents". Format (self.bloger, int (pagenumber)) HTML = Requests.get (url=blogpagelink). Text soup = beautifulsoup (html, "Html.parser") links = soup.find ("div", {
            "id": "article_list"}). Find_all ("span", {"Class": "Link_title"}) Articleids = [] for link in Links: temp = Link.find ("a"). attrs[' href '] articleids.append (Temp.split ("/") [-1]) # print (Len (ArticleID s)) # print (Articleids) return articleids def get_all_articleids (self): pages = Int (self._get TotalPages ()) Articleids = [] for index in range (pages): Tempids = self._parsepage (int (index+1 ) Articleids.extend (tempids) return articleids if __name__ = = ' __main__ ': bs = Blogscanner (blo
    Ger= "Marksinoberg")# Print (Bs._gettotalpages ()) # Bs._parsepage (1) articleids = Bs.get_all_articleids () print (Len (articleids)) Print (Articleids)

Brush Tool Class

# Coding:utf8 # @Author: Guopu # @File: brushhelper.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com # @blog: http://blog.csdn.net/marksinobe RG # @Description: Open Brush Import requests import random import time from Redishelper import Redishelper class Fakeuseragent
    (object): "" "the collection of some user-agent, each popup out different UA, reduce the impact of the anti-reptile mechanism.  More content: Http://www.73207.com/useragent "" "Def __init__ (self): Self.uas = [" mozilla/5.0 (Linux; U Android 2.3.7; En-us; Nexus one build/frf91) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "," Mqqbrowser/26 mozilla/5.0 (Linux; U Android 2.3.7; ZH-CN; MB200 build/grj22; CyanogenMod-7) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "," JUC (Linux; U 2.3.7; ZH-CN; MB200; 320*480) ucweb7.9.3.103/139/999 "," mozilla/5.0 (Windows NT 6.1; WOW64; RV:7.0A1) Gecko/20110623 firefox/7.0a1 fennec/7.0a1 "," opera/9.80 (Android 2.3.4; Linux; Opera mobi/build-1107180945; U EN-GB) presto/2.8.149 version/11.10 "," mozilla/5.0 (Linux; U Android 3.0; En-us; Xoom build/hri39) applewebkit/534.13 (khtml, like Gecko) version/4.0 safari/534.13 "," mozilla/5.0 (IPhone; U CPU iPhone os 3_0 like Mac os X; En-US) applewebkit/420.1 (khtml, like Gecko) version/3.0 mobile/1a542a safari/419.3 "," mozilla/5.0 (IPhone; U CPU iPhone os 4_0 like Mac os X;  En-US) applewebkit/532.9 (khtml, like Gecko) version/4.0.5 mobile/8a293 safari/6531.22.7 "," mozilla/5.0 (IPad; U CPU os 3_2 like Mac os X; En-US) applewebkit/531.21.10 (khtml, like Gecko) version/4.0.4 mobile/7b334b safari/531.21.10 "," mozilla/5.0 ( BlackBerry; U BlackBerry 9800; EN) applewebkit/534.1+ (khtml, like Gecko) version/6.0.0.337 Mobile safari/534.1+ "," mozilla/5.0 (Hp-tablet; Linux; hpwos/3.0.0; U En-US) applewebkit/534.6 (khtml, like Gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0 "," mozilla/5.0 (symbianos/9.4; series60/5.0 nokian97-1/20.0.019; profile/midp-2.1 configuration/cldc-1.1) applewebkit/525 (khtml, like Gecko) browserng/7.1.18124 "," MOZILLA/5. 0 (compatible; MSIE 9.0; Windows Phone OS 7.5; trident/5.0; iemobile/9.0; HTC; Titan) "," mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2228.0 safari/537.36 "," mozilla/5.0 (Macintosh;  Intel Mac OS X 10_10_1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2227.1 safari/537.36 "," mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10 "," mozilla/5.0 (Windows NT 5.1; U En rv:1.8.1) gecko/20061208 firefox/2.0.0 Opera 9.50 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/534.57.2 (khtml, like Gecko) version/5.1.7 safari/534.57.2 "," mozilla/5.0 (Windows NT 6.1; WOW64) Applewebkit/537.36 (khtml, like Gecko) chrome/30.0.1599.101 safari/537.36 "," mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; Lbbrowser) "," mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; qqbrowser/7.0.3698.400) "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 ubrowser/4.0.3214.0, safari/537.36 "," Mozill a/5.0 (Linux; U Android 2.2.1; ZH-CN; htc_wildfire_a3333 build/frg83d) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "," Mozi lla/5.0 (BlackBerry; U BlackBerry 9800; EN) applewebkit/534.1+ (khtml, like Gecko) version/6.0.0.337 Mobile safari/534.1+ "," mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; trident/5.0; iemobile/9.0; HTC; Titan) "," mozilla/4.0 (compatible; MSIE 6.0; ) opera/ucweb7.0.2.37/28/999 "," openwave/ucweb7.0.2.37/28/999 "," nokia5700/ucweb7.0.2.37/28/999 " , "ucweb7.0.2.37/28/999", "mozilla/5.0" (Hp-tablet; Linux; hpwos/3.0.0; U En-US) applewebkit/534.6 (khtml, like Gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0 "," mozilla/5.0 (Linux ; U Android 3.0; En-us; Xoom build/hri39) applewebkit/534.13 (khtml, like Gecko) version/4.0 safari/534.13 "," opera/9.80 (Android 2.3. 4; Linux; Opera mobi/build-1107180945; U EN-GB) presto/2.8.149 version/11.10 "," mozilla/5.0 (IPad; U CPU os 4_3_3 like Mac os X; En-US) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8j2 safari/6533.18.5 "," Def _generatein Dexes (self): numbers = Random.randint (0, Len (self.uas)) indexes = [] While Len (indexes) < Numbe
            Rs:temp = Random.randrange (0, Len (self.uas)) if temp not in Indexes:indexes.append (temp) return Indexes def Popupuas (self): UAS = [] indexes = self._generateindexes () to index in I Ndexes:uas.append (Self.uas[index]) return UAS class Brush (object): "" "" Brush The Volume "" "D EF __init__ (self, bloger= "Marksinoberg"): Self.bloger = "http://blog.csdn.net/{}". Format (Bloger) self.head ers = {' Host ': ' blog.csdn.net ', ' upgrade-insecure-requests ': ' 1 ', ' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; 

    x64) applewebkit/537.36 (khtml, like Gecko) chrome/57.0.2987.110 safari/537.36 ', self.rh = Redishelper () def getrandproxyip (self): IP = self.rh.popupAvalibeIp () Proxyip = {} Ipinfo = "http://{}". fo Rmat (str (ip.decode (' UTF8 ')) proxyip[' http '] = ipinfo # print (PROXYIP) return Proxyip


    def brushlink (self, ArticleID, randuas=[]): # http://blog.csdn.net/marksinoberg/article/details/78058279 Bloglink = "{}/article/details/{}". Format (Self.bloger, ArticleID) for UA in Randuas:self.header
            s[' user-agent ' = ua timeseed = random.randint (1, 3) print ("temporary hibernation: {} seconds". Format (timeseed)) Time.sleep (timeseed) for index in range (timeseed): # Requests.get (Url=bloglink, Headers=sel F.headers, Proxies=self.getrandproxyip ()) Requests.get (Url=bloglink, headers=self.headers) If __name__ =
    = ' __main__ ': # Fua = fakeuseragent () # indexes = [0, 2,5,7] # indexes = generate_random_numbers (0, 18, 7) # Randuas = Fua.popupuas (indexes) # Randuas = Fua.popupuas () # Print (len (randuas)) # print (Randuas) # p Rint (fua._generateindexes ()) Brush = Brush ("Marksinoberg") # Brush.brushlink (78058279, Randuas) print (BRUSH.G
 Etrandproxyip ())

Portal

# Coding:utf8 # @Author: Guopu # @File: main.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com # @blog: http://blog.c
Sdn.net/marksinoberg # @Description: Entrance from csdn import * to Redishelper import redishelper from brushhelper import * 
    Import Threading def main (): RH = Redishelper () bs = Blogscanner (bloger= "Marksinoberg") Fua = Fakeuseragent () Brush = Brush (bloger= "Marksinoberg") counter = 0 while counter < 12: # Open the Brush print ("{} times.") ". Format (counter)) Try:uas = Fua.popupuas () ArticleID = Rh.popuparticleid () b Rush.brushlink (ArticleID, UAS) except Exception as E:print (e) # to add a log handler counter +=1 if __name__ = = ' __main__ ': For I in range (280): temp = Threading. Thread (Target=main) Temp.start ()

Run Results

I took an article I had written before and tested it.
Blog Links: http://blog.csdn.net/marksinoberg/article/details/78058279

Before opening the brush for 301 browsing volume, after a simple brush, after the visit volume for the following figure:

Summary

This is roughly what it looks like, although this is at best a prototype, because the code is about 45% degrees complete. Interested can add me QQ1064319632, or leave your suggestion in the comment, everybody exchanges together, study together.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More