What is Scrapy?
Scrapy is an application framework written to crawl Web site data and extract structural data, simply to understand that it is both a powerful reptile framework
Why use this framework?
Because of its powerful features:
-Apply twisted, download page, achieve concurrency effect
-HTML parsing object with lxml
-Can set proxy
-Can set delay download
-can be customized to remove weight
-can set depth first, breadth first
-can be implemented with Redis, distributed crawler
Installation:
Linux:
PIP3 Install Scrapy
Windows:
: http://www.lfd.uci.edu/~gohlke/pythonlibs/
Download to file: TWISTED-17.5.0-CP36-CP36M-WIN_AMD64.WHL CP refers to the Python interpreter version after 64 refers to the 64-bit win system to download the appropriate version
Then install: Pip Install TWISTED-17.5.0-CP36-CP36M-WIN_AMD64.WHL
And then there are 2 pips
Pip Install Scrapy
Pip Install Pypiwin32
Its frame diagram is as follows:
How to create a crawler?
-Create a crawler project
Scrapy Startproject SP2 (SP2 is the project name)
Enter the project and create the crawler
CD SP2
Scrapy genspider Chouti chouti.com (Chouti is the reptile name, chouti.com is the crawler's crawl limited domain name)
Run crawler
Scrapy Crawl Chouti (Chouti for Reptile name)
In general, we do not look at log:
Scrapy Crawl Chouti--nolog
Project Framework diagram:
Where ' set start url.py ' is not required.
Scrapy.cfg is a simple configuration file.
Settings is a detailed configuration file
Item and pipelines are used for formatting, serialization
Middlewares is used to write middleware.
Folder Spiders is a crawler file, used to parse data, write callback function, etc., through the 2 yield to the leveling device and pipelines data
These are the simplest crawlers in the scrapy frame.
Example:
Here is an example of a beautiful picture of a beauty on the Internet, to experience the running flow of a simple scrapy crawler:
1 The code in the Spide folder in the crawler file Xiaohuar
#-*-coding:utf-8-*-Importscrapy fromScrapy.selectorImportSelector fromScrapy.httpImportRequest#Import Requests#Import Urllib.requestclassXiaohuarspider (scrapy. Spider): Name='Xiaohuar'Allowed_domains= ['xiaohuar.com'] Start_urls= ['http://www.xiaohuar.com/hua/'] defParse (self, response):PassHxS= Selector (Response=response)#not with BeautifulSoup. Text is treated as a parameterGirl_list = Hxs.xpath ('//*[@id = "List_img"]/div/div[1]/div')#copy-copy XPath from the browser #//appears at the top of the expression from the entire HTML to find out otherwise represented from the descendants of the #/cannot appear at the top only in the middle of the expression from the son if the following is the @ property name or text () to indicate the value of the property or #.//And *//appear at the top of the list from the current descendants to find the most front is./or */or nothing to write a representation from the current son. #img_list = []Count = 1 forGirlinchGirl_list:#it does take 25 objects here. Print(count)#here Print from 1 to 25 proves that there are really 25 objects in girl_list but only the first 10 URLs are downloaded why? Count + = 1text= Girl.xpath ('Div[1]/div[2]/span/a/text ()'). Extract_first ()#find out about the homecoming #self.filename = textimg = Girl.xpath ('div[1]/div[1]/a/img/@src'). Extract_first ()#self.url = ' http://www.xiaohuar.com ' + imgURL ='http://www.xiaohuar.com'+img Img_path= R'f:\ Reptile \%s.jpg'%text#res = request.get (URL). Content #Urllib.request.urlretrieve (Url,img_path) #img_list.append (URL) #print (text,img) from.. ItemsImportSp1item## yield Sp1item (Url=img_url, Text=self.filename) yieldSp1item (Url=url, text=img_path) Result= Hxs.xpath ('//*[@id = "page"]/div/a/@href') #print (Result) #print (Result.extract_first ()) #print (Result.extract ()) #yield Item (xxxx) # go to item.py and go to piplines.py to do persistence. Here is the pseudo-code . #Recursiveresult = Result.extract ()#sure enough. The code behind the list of strings will be executed correctly in the video. The teacher was negligent, but how to do it is unclear how the process of recursion is unclear forUrl1inchResult#It turns out that if result is an object that is not a list, this code and the downstream code are invalid #print (URL) yieldRequest (Url=url1,callback=self.parse)#The URL is assigned to the 9th line of Start_urls and then back to parse to re-execute
View Code
Mainly divided into analytic and 2 yield
Parsing is using the module
From Scrapy.selector import Selector
2 yield, respectively, for persisting and iterating over the pages of a picture
Next comes the code in item and pipelines
ImportscrapyclassSp1item (scrapy. Item):#Define the fields for your item here is like: #name = Scrapy. Field () #image_urls = scrapy. Field () #images = scrapy. Field () #image_path = scrapy. Field () #PassURL =Scrapy. Field () Text=Scrapy. Field ()#print (URL) #print (text)
View Code
#Import Urllib.request#Import RequestsclassSp1pipeline (object):def __init__(self): Self.f=None#self.res = None Pass defProcess_item (self, item, spider):ImportRequests fromScrapy.httpImportRequest Res= Request (item['URL']) Self.f= Open (R'f:\ Reptile \%s.jpg'% item['text'],'WB') Self.f.write (res.body) self.f.close ()Print(item)## if Spider.name = = ' Xiaohuarvideo ': #VName = R ' f:\ crawler \video\%s.mp4 '% item[' url '] ## urllib.request.urlretrieve (item[' url '],vname) #res = requests.get (item[' url ') #with open (VName, ' WB ') as F: #f.write (res.content) #print ('%s download complete '% item[' url ') #Pass returnItemdefOpen_spider (self,spider):"""when the crawler starts executing, call the:p Aram Spider:: return:""" Print('Reptile Start') #self.f = open ('%s.jpg '% name, ' WB ') defClose_spider (self, spider):"""when the crawler shuts down, it is called:p Aram Spider:: return:""" Print('Reptile End') #self.f.close ()
View Code
Of course, the configuration file settings to set up
#to set the recursive depth of a crawlDepth_limit = 1#Crawl responsibly by identifying yourself (and your website) on the User-agent#user_agent = ' SP1 (+http://www.yourdomain.com) '#whether to abide by the crawler protocol#Obey robots.txt Rules#Robotstxt_obey = TrueRobotstxt_obey =False#Configure maximum concurrent requests performed by Scrapy (default:16)#concurrent_requests =#Configure A delay for requests for the same website (default:0)#See Https://doc.scrapy.org/en/latest/topics/settings.html#download-delay#See also autothrottle settings and Docs#Delay Download seconds#Download_delay = 3#The download delay setting would honor only one of:#Concurrent_requests_per_domain =#concurrent_requests_per_ip =#Disable Cookies (enabled by default)#cookies_enabled = False#Disable Telnet Console (enabled by default)#telnetconsole_enabled = False#Override The default request headers:#default_request_headers = {#' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ',#' accept-language ': ' en ',#}#Enable or disable spider middlewares#See https://doc.scrapy.org/en/latest/topics/spider-middleware.html#spider_middlewares = {#' sp1.middlewares.Sp1SpiderMiddleware ': 543,# }#Enable or disable downloader middlewares#See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#downloader_middlewares = {#' sp1.middlewares.Sp1DownloaderMiddleware ': 543,#}#Enable or disable extensions#See https://doc.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {#' scrapy.extensions.telnet.TelnetConsole ': None,#}#Configure Item Pipelines#See https://doc.scrapy.org/en/latest/topics/item-pipeline.html#set the persisted file path and its priority, typically from 0 to 1000, the smaller the number the higher the priorityItem_pipelines = {'Sp1.pipelines.Sp1Pipeline': 300}
View Code
Finally climbed down more than 1000 pretty Little sisters (though actually all younger than me) pictures of
Of course, Scrapy also has a lot of advanced features, and the example is just the basic scrapy crawler OH
Learning experience:
The Scrapy framework is the most mainstream framework in crawlers, through the learning of these days. Makes me feel the necessity of the framework:
It is so powerful that we especially weak chicken programming enthusiasts can also make a look good performance, perfectly formed of the small reptile.
Framework learning and use let me experience what is called good code, such as high scalability, the need to know the existence of the configuration file. As long as the configuration file in accordance with their own actual needs of small changes, you can achieve another effect
。 Of course, the middleware and the signal is more powerful, which I need to learn AH. One last word: Life is short, I use python!
The three scrapy framework of Python Learning