The three scrapy framework of Python Learning

Last Update:2018-07-13 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is Scrapy?

Scrapy is an application framework written to crawl Web site data and extract structural data, simply to understand that it is both a powerful reptile framework

Why use this framework?

Because of its powerful features:

-Apply twisted, download page, achieve concurrency effect
-HTML parsing object with lxml
-Can set proxy
-Can set delay download
-can be customized to remove weight
-can set depth first, breadth first

-can be implemented with Redis, distributed crawler

Installation:

Linux:
PIP3 Install Scrapy

Windows:

: http://www.lfd.uci.edu/~gohlke/pythonlibs/

Download to file: TWISTED-17.5.0-CP36-CP36M-WIN_AMD64.WHL CP refers to the Python interpreter version after 64 refers to the 64-bit win system to download the appropriate version

Then install: Pip Install TWISTED-17.5.0-CP36-CP36M-WIN_AMD64.WHL

And then there are 2 pips
Pip Install Scrapy
Pip Install Pypiwin32

Its frame diagram is as follows:

How to create a crawler?

-Create a crawler project
Scrapy Startproject SP2 (SP2 is the project name)

Enter the project and create the crawler
CD SP2
Scrapy genspider Chouti chouti.com (Chouti is the reptile name, chouti.com is the crawler's crawl limited domain name)
Run crawler
Scrapy Crawl Chouti (Chouti for Reptile name)

In general, we do not look at log:

Scrapy Crawl Chouti--nolog

Project Framework diagram:

Where ' set start url.py ' is not required.

Scrapy.cfg is a simple configuration file.

Settings is a detailed configuration file

Item and pipelines are used for formatting, serialization

Middlewares is used to write middleware.

Folder Spiders is a crawler file, used to parse data, write callback function, etc., through the 2 yield to the leveling device and pipelines data

These are the simplest crawlers in the scrapy frame.

Example:

Here is an example of a beautiful picture of a beauty on the Internet, to experience the running flow of a simple scrapy crawler:

1 The code in the Spide folder in the crawler file Xiaohuar

#-*-coding:utf-8-*-Importscrapy fromScrapy.selectorImportSelector fromScrapy.httpImportRequest#Import Requests#Import Urllib.requestclassXiaohuarspider (scrapy. Spider): Name='Xiaohuar'Allowed_domains= ['xiaohuar.com'] Start_urls= ['http://www.xiaohuar.com/hua/']    defParse (self, response):PassHxS= Selector (Response=response)#not with BeautifulSoup. Text is treated as a parameterGirl_list = Hxs.xpath ('//*[@id = "List_img"]/div/div[1]/div')#copy-copy XPath from the browser        #//appears at the top of the expression from the entire HTML to find out otherwise represented from the descendants of the        #/cannot appear at the top only in the middle of the expression from the son if the following is the @ property name or text () to indicate the value of the property or        #.//And *//appear at the top of the list from the current descendants to find the most front is./or */or nothing to write a representation from the current son.        #img_list = []Count = 1 forGirlinchGirl_list:#it does take 25 objects here.            Print(count)#here Print from 1 to 25 proves that there are really 25 objects in girl_list but only the first 10 URLs are downloaded why? Count + = 1text= Girl.xpath ('Div[1]/div[2]/span/a/text ()'). Extract_first ()#find out about the homecoming            #self.filename = textimg = Girl.xpath ('div[1]/div[1]/a/img/@src'). Extract_first ()#self.url = ' http://www.xiaohuar.com ' + imgURL ='http://www.xiaohuar.com'+img Img_path= R'f:\ Reptile \%s.jpg'%text#res = request.get (URL). Content            #Urllib.request.urlretrieve (Url,img_path)            #img_list.append (URL)            #print (text,img)             from.. ItemsImportSp1item## yield Sp1item (Url=img_url, Text=self.filename)            yieldSp1item (Url=url, text=img_path) Result= Hxs.xpath ('//*[@id = "page"]/div/a/@href')        #print (Result)        #print (Result.extract_first ())        #print (Result.extract ())        #yield Item (xxxx) # go to item.py and go to piplines.py to do persistence. Here is the pseudo-code .        #Recursiveresult = Result.extract ()#sure enough. The code behind the list of strings will be executed correctly in the video. The teacher was negligent, but how to do it is unclear how the process of recursion is unclear         forUrl1inchResult#It turns out that if result is an object that is not a list, this code and the downstream code are invalid            #print (URL)            yieldRequest (Url=url1,callback=self.parse)#The URL is assigned to the 9th line of Start_urls and then back to parse to re-execute

View Code

Mainly divided into analytic and 2 yield

Parsing is using the module

From Scrapy.selector import Selector

2 yield, respectively, for persisting and iterating over the pages of a picture

Next comes the code in item and pipelines

ImportscrapyclassSp1item (scrapy. Item):#Define the fields for your item here is like:    #name = Scrapy. Field ()    #image_urls = scrapy. Field ()    #images = scrapy. Field ()    #image_path = scrapy. Field ()    #PassURL =Scrapy. Field () Text=Scrapy. Field ()#print (URL)    #print (text)

View Code

#Import Urllib.request#Import RequestsclassSp1pipeline (object):def __init__(self): Self.f=None#self.res = None        Pass    defProcess_item (self, item, spider):ImportRequests fromScrapy.httpImportRequest Res= Request (item['URL']) Self.f= Open (R'f:\ Reptile \%s.jpg'% item['text'],'WB') Self.f.write (res.body) self.f.close ()Print(item)## if Spider.name = = ' Xiaohuarvideo ':        #VName = R ' f:\ crawler \video\%s.mp4 '% item[' url ']        ## urllib.request.urlretrieve (item[' url '],vname)        #res = requests.get (item[' url ')        #with open (VName, ' WB ') as F:        #f.write (res.content)        #print ('%s download complete '% item[' url ')        #Pass        returnItemdefOpen_spider (self,spider):"""when the crawler starts executing, call the:p Aram Spider:: return:"""        Print('Reptile Start')        #self.f = open ('%s.jpg '% name, ' WB ')    defClose_spider (self, spider):"""when the crawler shuts down, it is called:p Aram Spider:: return:"""        Print('Reptile End')        #self.f.close ()

View Code

Of course, the configuration file settings to set up

#to set the recursive depth of a crawlDepth_limit = 1#Crawl responsibly by identifying yourself (and your website) on the User-agent#user_agent = ' SP1 (+http://www.yourdomain.com) '#whether to abide by the crawler protocol#Obey robots.txt Rules#Robotstxt_obey = TrueRobotstxt_obey =False#Configure maximum concurrent requests performed by Scrapy (default:16)#concurrent_requests =#Configure A delay for requests for the same website (default:0)#See Https://doc.scrapy.org/en/latest/topics/settings.html#download-delay#See also autothrottle settings and Docs#Delay Download seconds#Download_delay = 3#The download delay setting would honor only one of:#Concurrent_requests_per_domain =#concurrent_requests_per_ip =#Disable Cookies (enabled by default)#cookies_enabled = False#Disable Telnet Console (enabled by default)#telnetconsole_enabled = False#Override The default request headers:#default_request_headers = {#' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ',#' accept-language ': ' en ',#}#Enable or disable spider middlewares#See https://doc.scrapy.org/en/latest/topics/spider-middleware.html#spider_middlewares = {#' sp1.middlewares.Sp1SpiderMiddleware ': 543,# }#Enable or disable downloader middlewares#See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#downloader_middlewares = {#' sp1.middlewares.Sp1DownloaderMiddleware ': 543,#}#Enable or disable extensions#See https://doc.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {#' scrapy.extensions.telnet.TelnetConsole ': None,#}#Configure Item Pipelines#See https://doc.scrapy.org/en/latest/topics/item-pipeline.html#set the persisted file path and its priority, typically from 0 to 1000, the smaller the number the higher the priorityItem_pipelines = {'Sp1.pipelines.Sp1Pipeline': 300}

View Code

Finally climbed down more than 1000 pretty Little sisters (though actually all younger than me) pictures of

Of course, Scrapy also has a lot of advanced features, and the example is just the basic scrapy crawler OH

Learning experience:

The Scrapy framework is the most mainstream framework in crawlers, through the learning of these days. Makes me feel the necessity of the framework:

It is so powerful that we especially weak chicken programming enthusiasts can also make a look good performance, perfectly formed of the small reptile.

Framework learning and use let me experience what is called good code, such as high scalability, the need to know the existence of the configuration file. As long as the configuration file in accordance with their own actual needs of small changes, you can achieve another effect

。 Of course, the middleware and the signal is more powerful, which I need to learn AH. One last word: Life is short, I use python!

The three scrapy framework of Python Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More