<20>python Learning notes-crawler 2--back crawl

Source: Internet
Author: User
Tags mkdir

Novice reference to learn http://cuiqingcai.com/3256.html, the original writing is really good. Thank.

You will encounter Web site anti-crawler strategy the following: 1/Limit IP access frequency, over the frequency of disconnect. (The solution is to reduce the speed of the crawler by adding time.sleep to each request, or to constantly replace the proxy IP, bypassing the anti-reptile mechanism.) 2/Background to the access statistics, if a single useragent access over the threshold, to be blocked. (The effect is surprisingly good.) But the accidental injury is also super big, the general site will not use, but we also consider in 3/also for cookies (this solution is simpler, the general site will not be used)
To do a reverse crawl module, you need to use the following modules: requests re (regular expression) random (random selection)
———— processing useragent access frequency problem
Some sites will limit the same user-agent access frequency, then we give him a random user-agent.
The idea is: Baidu a pile of user-agent back, with random selection to choose. Code:

Import requests Import re import random #创建一个反反爬的类 class Download:def __init__ (self): self.user_agent_list[ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "," mozilla/5.0 (X11; Cros i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) A pplewebkit/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11;  Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1;WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) A pplewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (W indows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows N T 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) A pplewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) A pplewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) APPLEWEBK it/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.2;
        WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "] def get (Self,url):
        UA = Random.choice (self,user_agent_list) # #从user_agent_list中随机取出一个字符串. headers = {' user_agent ': ua}# #构造一个完整的User_Agent response = Requests.get (url,headers=headers) return respons E
———— handling IP Access frequency issues
This will use regular expressions. Dot I look at the regular Expression basic tutorial to get an IP proxy site, this many http://haoip.cc/tiqu.htm
___ All the code for the second stage --Main program codeThe anti-crawl module was imported
#导入需要的包 from BS4 import beautifulsoup import OS backclimb import Request #建立一个抓图类 class Mzitu (): #主入口函数 def All_url (self,url): HTML = Request.get (url,3) # #调用request函数吧地址传进去, returns a response All_a = BeautifulSoup (HTML. Text, ' lxml '). Find (' div ', class_= ' all '). Find_all (' a ') # #用Soup对象的find方法找标签先查找class为all的div标签, then look for all <a> tags. Find_all is to find all <a> tags for a in all_a:title = A.get_text () print (U ' start Save: ', title) # #提示保存X XX path = str (title). Replace ('? ', ' _ ') # #设置名字变量, will.
            Replaced by _ Self.mkdir (path) # #调用mkdir函数创建文件夹, path represents the title Os.chdir (' F:\mzitu\\ ' +path) # #改变工作路径 href = a[' href ']# #取出 the href attribute self.html (href) # #调用html函数吧href参数传递过去, href is the address of the set of graphs #设置一个函数处理套图地址获得图片 Page address def HTML (self,href): HTML = Request.get (href,3) # #调用request函数把套图地址传进去, returns a response Max_span = Bea Utifulsoup (Html.text, ' lxml '). Find_all (' span ') [10].get_text () # #查找所有的 <span> tag Gets the text in the Last tab, which is the last page.Face for page in range (1,int (Max_span) +1): # #用range产生页面序列 page_url = href + '/' +str (page) # #手动拼接每一个页面地址 Self.img (page_url) # #调用img函数 #设置一个函数处理图片页面地址获得图片的实际地址 def img (self,page_url): img_html = Request . Get (page_url,3) # #调用request函数把图片页面地址传进去, returns a response Img_url = BeautifulSoup (Img_html.text, ' lxml '). Find (' div ', CLA
        ss_= ' Main-image '). FIND (' img ') [' src ']# #用img_Soup对象的find方法找标签先查找class为main-image div tag, then look up  tag inside SRC. Self.save (Img_url) # #调用save函数保存图片, pass Img_url address past #设置一个保存图片的函数 def save (self,img_url): name = Img_url[-9:-4 ]# #取url的倒数第四至第九位做图片的名字 print (' Start Save: ', Img_url) img = Request.get (img_url,3) # #调用request函数把图片地址传进去, returns a respon Se f = open (name+ '. jpg ', ' ab ') # #写入多媒体文件必须要b这个参数 F.write (img.content) # #多媒体文件要用conctent F.close () # #关闭 File Object #创建一个函数用来创建文件夹 def mkdir (self,path): Path = Path.strip () # #去除path前后空格 isexists = Os.path.exi STS (Os.path.join (' F:\mzitu ', pATH)) # #join将各部分合成一个路径名.
            Os.path.exists to determine if the following path exists if not isexists: # #如果为False, create folder print (U ' creates a folder named ', Path,u '! ')
            Os.makedirs (Os.path.join (' F:\mzitu ', Path)) # #创建多层文件夹, using join to synthesize a separate path folder return True else: The folder for print (U ' named ', Path,u ') already exists. ' Return False # # #创建获取网页response的函数并返回 # def request (Self,url): # headers = {' User-agen T ': ' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 '} #设置浏览器请求头 # content = requests.ge
T (url,headers=headers) # #使用requests中的get方法获取页面的内容, plus the browser request header # return content Mzitu = Mzitu () # #实例化 #这就是入口. Mzitu.all_url (' Http://www.mzitu.com/all ')
--Anti-crawl code
Import requests import re import random import time #创建一个反反爬的类 class Download:def __init__ (self): self.ipli st = [] # #初始化一个list用来存放获取到的ip HTML = requests.get (' http://haoip.cc/tiqu.htm ') # #使用requests中的get方法获取页面的内容 I PLISTN = Re.findall (R ' r/> (. *?) <b ', Html.text,re. S) # #正则表达式, which means to get the contents of all r/><b from HTML, re.
            s means to include the match including line breaks, FindAll returns the list for IP in iplistn:i = re.sub (' \ n ', ', IP) # #利用re. Sub substitution method, replace \ n with null Self.iplist.append (I.strip ()) # #将两端去除空格后添加到上面的list里面 self.user_agent_list=[' mozilla/5.0 (Windows N T 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "," mozilla/5.0 (X11; Cros i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) A pplewebkit/536.6 (khtml, like Gecko) Chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11;  Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) A pplewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (W indows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows N T 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," Mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," MOZILLA/5. 0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) A pplewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) APPLEWEBK it/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "] def get (SELF,URL,TIMEOUT,PR oxy=none,num_retries=6): # #给函数一个默认参数proxy为空, the default num_retries is 6 UA = Random.choice (self.user_agent_list) # #从user_agent
        A string is randomly fetched from the _list.
        headers = {' user_agent ': ua}# #构造一个完整的User_Agent If proxy = = none:# #当代理为空时, do not use proxy to get response try:        Return Requests.get (url,headers=headers) # #返回一个requests. Get the paging file, call the random headers, the server thought we were real browsers except:# #如
                    If the above code executes an error, execute the following code if Num_retries >0: # #num_retries是限定的重试次数 Time.sleep # #延迟10秒 Print (U ' get page error, 10s will get the penultimate: ', Num_retries,u ') return Self.get (URL,TIMEOUT,NUM_RETR
                    IES-1) # #调用自身, and minus 1, realize the Cycle 6 times else:print (U ' start using proxy ') Time.sleep (10) IP = '. Join (Str (Random.choice (self.iplist)). Strip ()) # #将从self. IPList the randomly fetched string is processed into the desired format.
                    Remove both sides of the space, with join splicing. Proxy = {' http ': IP} return Self.get (url,timeout,proxy) # #代理不为空的时候 Else: # #当代理不为空 T Ry:ip = '. Join (Str (Random.choice (self.iplist)). Strip ()) # #将从self. IPList the randomly fetched string is processed into the desired format.
                Remove both sides of the space, with join splicing. Proxy = {' http ': ip}# #构造成一个代理 return requests.get (url,headers=headers,proxies = Proxy,timeout=timeout) # #使用代理获取response except:if num_retries >0:time.sleep (10)
                    ip = '. Join (Str (Random.choice (self.iplist)). Strip ()) proxy = {' http ': IP} Print (U ' is replacing the agent, after 10s will regain the penultimate ', Num_retries,u ') Print (U ' current agent is: ', proxy) return Self.get (url,timeout,proxy,num_retries-1) else:print (U ' agent also does not do well. Cancel Agent ') return Self.get (url,3) request = Download ()












Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.