Summary of the most complete Python crawlers

Last Update:2016-05-18 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently always to crawl some things, simply to the Python crawler related content are summed up, their own more hands or good.

(1) Normal content crawling
(2) Save crawled Pictures/videos and files and pages
(3) Normal analog login
(4) Process Verification code Login
(5) Crawl JS website
(6) Full web crawler
(7) All directory crawlers in a website
(8) Multithreading
(9) Reptile frame Scrapy

One, the normal content crawl

#coding =utf-8import urllib  import urllib2  url = ' http://www.dataanswer.top '  headers = {' Host ': ' Www.dataanswer.top ', ' user-agent ': ' mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) gecko/20100101 firefox/31.0 ', # ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', # ' accept-language ': ' zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 ', # ' accept-encoding ': ' Gzip,deflate ', # ' Referer ': ' Http://www.dataanswer.top '}   request = Urllib2. Request (url,headers=headers)  response = Urllib2.urlopen (request)  page = response.read () print page

Second, save crawled Pictures/videos and files and pages
After capturing the address of the #图片/video and file and webpage, use the Urlretrieve () method in the module urllib to download it:

#coding =utf-8import urllib  import urllib2  import osdef getpage (URL):     request = Urllib2. Request (URL)          response = Urllib2.urlopen (Request)          return Response.read ()  url= '/http www.dataanswer.top/'  result=getpage (URL)  file_name= ' Test.doc ' file_path= ' Doc ' If Os.path.exists (file_path ) = = False:    os.makedirs (File_path) Local=os.path.join (file_path,file_name) f = open (local, "w+")  F.write ( Result) F.close () #coding =utf-8import urllib  import urllib2  import osdef getpage (URL):     request = Urllib2. Request (URL)          response = Urllib2.urlopen (Request)          return Response.read ()  url= '/http www.dataanswer.top/'  #把该地址改成图片/File/video/web address result=getpage (URL)  file_name= ' Test.doc ' file_path= ' Doc ' If os.path.exists (file_path) = = False:    os.makedirs (File_path) local=os.path.join (file_path,file_name) Urllib.urlretrieve (local)

Three, common analog login

Import urllibimport urllib2import cookielib filename = ' cookie.txt ' #声明一个MozillaCookieJar对象实例来保存cookie, then write file cookie = Cookielib. Mozillacookiejar (filename) opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) PostData = Urllib.urlencode ({' Name ': ' Spring ', ' pwd ': ' 1222222 '}) #登录的URLloginUrl = '/http ' Www.dataanswer.top/LoginService?action=tologin ' #模拟登录 and save the cookie to the variable result = Opener.open (loginurl,postdata) # Save cookies to Cookie.txt Cookie.save (ignore_discard=true, ignore_expires=true) #利用cookie请求访问另一个网址gradeUrl = ' http// Www.dataanswer.top/LoginService?action=myHome ' #请求访问result = Opener.open (gradeurl) print result.read ()

Four, process verification code login
#先把验证码图片下载下来保存, then manually read in

#coding =utf-8import sys, time, OS, Reimport urllib, urllib2, Cookielibloginurl = ' https://www.douban.com/accounts/ Login ' cookie = cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) params = {"Form_email": "13161055481", "Form_password": "Wwwwwww", "source": "Index_nav" # No sign-in unsuccessful} #从首页提交登录response =opener.open (loginurl) #验证成功跳转至登录页print (Response.geturl ()) if response.geturl () = = "https  ://www.douban.com/accounts/login ": Html=response.read () print (HTML) #验证码图片地址--image address encryption What to do??? Imgurl=re.search ('  ', HTML) if Captcha:vcode=raw_input (' Please enter the verification code on the image: ') params["captcha-solution"]=vcode params["captcha -id "]=captcha.group (1) params[" User_login "]=" Login "#提交验证码验证      Response=opener.open (loginurl, Urllib.urlencode (params)) "Login successfully jump to home" if response.geturl () = = "https: www.douban.com/": print ' login success! ' Print ' ready to post ' addtopicurl= ' Http://www.douban.com/group/python/new_topic "Res=opener.open (Addtopicu RL) Html=res.read () else:print ("Fail3") else:print ("Fail2") else:print ("Fail1") else:print ("Fail0")

Five, crawl JS website
#利用selenium模拟浏览器, combined with parsing of HTML

#coding =utf-8#1, installing Python-pip#sudo apt-get install python-pip#2, installing Selenium#sudo pip install-u seleniumfrom Selenium Import Webdriverdriver = Webdriver. Firefox () driver.get (' http://www.newsmth.net/nForum/#!article/Intern/206790 ') html=driver.page_source.encode (' Utf-8 ', ' Ignore ') #这个函数获取页面的htmlprint (HTML) driver.close ()

Six, the whole net crawler
#广度优先, simulating a crawl queue

#coding =utf-8 "" "Crawl all links, including outside chain-breadth First" "" Import urllib2import refrom bs4 import beautifulsoupimport time# crawler start time t= Time.time () #设置的暂停爬取条数N_STOP =10# store the crawled urlchecked_url=[] #存放待爬取的urlCHECKING_URL =[] #存放连接失败的urlFAIL_URL =[]# Store urlerror_url=[] that cannot be connected #失败后允许连接的次数RETRY =3# connection Timeout timeout=20class url_node:def __init__ (self,url): "" "URL Node initialization: param url:string Current URL "" "Self.url=urlself.content=" Def __is_connectable (self): "" "Verify URL can connect" "" #在允许连接次数下连接for I in range (RETRY): try: #打开url没有报错, which means you can connect Response=urllib2.urlopen (self.url,timeout=timeout) return trueexcept:# If you make an error while trying to allow connections, you cannot connect if I==retry-1:return falsedef get_next (self): "" gets all other URLs that are contained in the page "" "Soup=beautifulsoup ( Self.content) #****************** Here you can parse the content you want from the page ************************************next_urls=soup.findall (' A ') If Len (next_urls)!=0:for link in next_urls:tmp_url=link.get (' href ') # If the URL does not exist in the crawled list or in the list to be crawled, it is placed in the list to be crawled (not ensuring that the URL is valid) if Tmp_url not in Checked_url and Tmp_url not in Checking_url:checking_ Url.append (Tmp_url) def run (self): if Self.url:if self.__Is_connectable (): Try: #获取爬取页面的所有内容self. Content=urllib2.urlopen (self.url,timeout=timeout). Read () # From this page, get Urlself.get_next () except: #把连接失败的存放起来FAIL_URL. Append (self.url) print (' [!] Connect Failed ') Else: #把不能连接的存放起来ERROR_URL. Append (Self.url) else:print ("There is a problem with the initial URL given! ") if __name__== ' __main__ ': #把初始的url放到待爬的列表中CHECKING_URL. Append (' http://www.36dsj.com/') #不断的从待爬的列表中获取url进行爬取ff = Open ("Mytest.txt", ' W ') i=0for URL in checking_url: #对该url进行爬取url_node (URL). Run () #存放已经爬取过的urlCHECKED_URL. Append (URL) #删除CHECKING_URL中已经爬取过的urlCHECKING_URL. Remove (URL) i+=1if i==n_stop: #打出停止时的url, you can use the URL as the initial continuation of print urlprint (" Crawled list length:%d ")% len (checked_url) print (" list length to crawl:%d ")% len (checking_url) print (" Connection failed list length:%d ")% len (fail_url) print (" Unable to connect list length:%d ")% len (error_url) breakff.close () print (" time:%d S ")% (Time.time ()-T)

Seven, all the crawler in a site's station
#把缩写的站内网址还原

#coding =utf-8 "" "Crawl all URLs of the same site, excluding the outer chain" "" Import urllib2import refrom bs4 import beautifulsoupimport timet=time.time () Host= ' Checked_url=[]checking_url=[]result=[]retry=3timeout=20class url_node:def __init__ (self,url): "" " URL node initialization: param url:string Current URL "" "Self.url=self.handle_url (url,is_next_url=false) self.next_url=[]self.content= ' def handle_url (self,url,is_next_url=true): "" "All URLs are processed into standard form" "Global checked_urlglobal checking_url# Remove trailing '/' url= Url[0:len (URL)-1] if url.endswith ('/') Else Urlif url.find (HOST) ==-1:if not Url.startswith (' http '): url= '//' +host +url if Url.startswith ('/') Else '/'/' +host+ '/' +urlelse: #如果含有http说明是外链, the host of the URL is not the current host, return null RETURNELSE:IF not Url.startswith (' http '): Url= '/http ' +urlif is_next_url: #下一层url放入待检测列表if URL not in CHECKING_URL:CHECKING_URL.append (URL) Else: #对于当前需要检测的url将参数都替换为1, and then join the Rule table # parameter the same type of different URLs only detects once Rule=re.compile (R ' =.*?\&|=.*?$ ') result=re.sub ( Rule, ' =1& ', url) if result in Checked_url:return ' [!] URL has checked! ' Else:CHECKED_URL.append (Result) RESult.append (URL) return urldef __is_connectable (self):p rint ("Enter __is_connectable () function") #检验是否可以连接retry =3timeout=2for I in range (RETRY): Try: #print ("Enter _ .....) function ") Response=urllib2.urlopen (self.url,timeout=timeout) return trueexcept:if I==retry-1:return Falsedef Get_next ( Self): #获取当前所有的url #print ("Enter Get_next () function") Soup=beautifulsoup (self.content) next_urls=soup.findall (' a ') if Len (next _urls)!=0:for link in Next_urls:self.handle_url (The Link.get (' href ')) #print (Link.text) def run (self): #print ("Enter the Run () function" ) If Self.url: #print self.urlif self.__is_connectable (): Try:self.content=urllib2.urlopen (Self.url,timeout=timeout) . Read () Self.get_next () except:print (' [!] Connect Failed ') #处理https开头的url的类和方法class poc:def Run (self,url): Global hostglobal checking_urlurl=check_url (URL) if Not Url.find (' https '): host=url[:8]else:host=url[7:]for URL in checking_url:print (URL) url_node (URL). Run () def Check_ URL (URL): url= ' http//' +url if not url.startswith (' http ') else Urlurl=url[0:len (URL)-1] if url.endswith ('/') Else urlfor I in Range (RETRY): Try:response=urllib2.urlopen (url,timeout=timeout) return urlexcept:raise Exception ("Connect error") if __name__== ' __main__ ': host= ' www.dataanswer.com ' checking_url.append (' http://www.dataanswer.com/') f=open (' 36 Big Data ', ' W ') for the URL in checking_url:f.write (url+ ' \ n ') print (URL) url_node (URL). Run () print resultprint "url num:" +str (Len ( RESULT) Print ("time:%d S")% (Time.time ()-T)

Eight, multithreading
#对列和线程的结合

#!/usr/bin/env python#-*-coding:utf-8-*-"" "A simple Python crawler that uses multiple threads to crawl all the top 250 videos of the watercress Top" "" Import urllib2, Re, Stringimport threading, Queue, Timeimport sysreload (SYS) sys.setdefaultencoding (' utf8 ') _data = []file_lock = Threading. Lock () Share_q = Queue.queue () #构造一个不限制大小的的队列_WORKER_THREAD_NUM = 3 #设置线程的个数class MyThread (threading. Thread): Def __init__ (Self, Func): Super (MyThread, self). __init__ () #调用父类的构造函数 Self.func = Func #传入线程 function Logic def run (self): Self.func () def worker (): Global share_q and not Share_q.empty (): url = SHAR        E_q.get () #获得任务 my_page = get_page (URL) find_title (my_page) #获得当前页面的电影名 #write_into_file (Temp_data) Time.sleep (1) share_q.task_done () def get_page (URL): "" Crawls Web page HTML args:url based on the URL given: indicates the current crawl Fetch page URL Returns: Returns the HTML (Unicode encoding) crawled to the entire page Raises:URLError:url the exception thrown "" "Try:my_page = Urllib2.urlopen (URL). read (). Decode ("Utf-8") except URLLIB2.Urlerror, E:if hasattr (E, "code"): print "The server couldn ' t fulfill the request." Print "Error code:%s"% E.code elif hasattr (E, "Reason"): print "We failed to reach a server. Please check your URL and read the Reason "print" Reason:%s "% E.reason return my_pagedef find_title (my_pag E): "" "by returning the entire page HTML, the regular matches the top 100 movie name Args:my_page: The HTML text of the incoming page is used for regular matching" "" Temp_data = [] movie_i tems = Re.findall (R ' <span.*?class= "title" > (. *?) </span> ', My_page, re.            S) for index, item in enumerate (Movie_items): If Item.find (" ") = =-1: #print Item, Temp_data.append (item) _data.append (Temp_data) def main (): global Share_q threads = [] Douban_url = "http://m            Ovie.douban.com/top250?start={page}&filter=&type= "#向队列中放入任务, when really used, should be set to a sustainable put task for index in xrange (10): Share_q.put (Douban_url.format (page = index *)) for I in Xrange (_Worker_thread_num): thread = MyThread (worker) Thread.Start () #线程开始处理任务print ("First%s threads start working")% i Threa Ds.append (thread) for the thread in Threads:thread.join () Share_q.join () with open ("Movie.txt", "w+") as My_ File:for page in _data:for movie_name in Page:my_file.write (movie_name + "\ n") pr int "Spider Successful!!!" if __name__ = = ' __main__ ': Main ()

Nine, reptile frame scrapy

items.py: Used to define variables that need to be saved, where the variables are defined in field, a bit like a python dictionary
pipelines.py: Used to process the extracted item, and the process is defined according to its own needs
Spiders: Define your own crawler

There are several types of reptiles:
1) Spider: The most basic crawler, other reptiles are generally inherited the most basic reptile, provide access to the URL, return response function, will default to call the Parse method
2) Crawlspider: To inherit spider crawler, the actual use of more, set rule rules for Web page follow-up and processing, note: When writing crawler rules to avoid the use of the parse name, because this will overwrite the method of the inherited spider parse caused errors. One of the more important is to rule the rules of the preparation, to the specific pages of the situation analysis.
3) Xmlfeedspider and Csvfeedspider

(1) Open command line, execute: scrapy startproject tutorial (project name)
(2) Scrapy.cfg is the project configuration file, the user wrote the spider to be placed under the Spiders directory
(3) parsing: The name attribute is important and different spiders cannot use the same name
Start_urls is the starting point for spiders to crawl Web pages and can include multiple URLs
The parse method is that the spider catches a page that is callback by default and avoids using that name to define its own method.
When the spider gets the content of the URL, it calls the parse method and passes a response parameter to it, response contains the contents of the captured page, and in the parse method, you can parse the data from the captured Web page.
(3) Start crawl, enter the generated project root directory tutorial/, execute scrapy crawl dmoz, DMOZ is the name of the spider.
(4) Save objects: Add classes to items.py that describe the data we want to save

From Scrapy.item Import Item, Field
Class Dmozitem (Item):
title = Field ()
link = Field ()
desc = Field ()
(5) Execute scrapy crawl dmoz--set Feed_uri=items.json--set Feed_format=json after the saved file
(6) Let scrapy automatically crawl all the links on the page

Extract the links we need in the parse method, construct some request objects, and return them, Scrapy will automatically crawl the links.

Summary of the most complete Python crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More