The three scrapy framework of Python Learning

Source: Internet
Author: User
Tags xpath

What is Scrapy?

Scrapy is an application framework written to crawl Web site data and extract structural data, simply to understand that it is both a powerful reptile framework

Why use this framework?

Because of its powerful features:

-Apply twisted, download page, achieve concurrency effect
-HTML parsing object with lxml
-Can set proxy
-Can set delay download
-can be customized to remove weight
-can set depth first, breadth first

-can be implemented with Redis, distributed crawler

Installation:

Linux:
PIP3 Install Scrapy

Windows:

: http://www.lfd.uci.edu/~gohlke/pythonlibs/

Download to file: TWISTED-17.5.0-CP36-CP36M-WIN_AMD64.WHL CP refers to the Python interpreter version after 64 refers to the 64-bit win system to download the appropriate version

Then install: Pip Install TWISTED-17.5.0-CP36-CP36M-WIN_AMD64.WHL

And then there are 2 pips
Pip Install Scrapy
Pip Install Pypiwin32

Its frame diagram is as follows:




How to create a crawler?

-Create a crawler project
Scrapy Startproject SP2 (SP2 is the project name)

Enter the project and create the crawler
CD SP2
Scrapy genspider Chouti chouti.com (Chouti is the reptile name, chouti.com is the crawler's crawl limited domain name)
Run crawler
Scrapy Crawl Chouti (Chouti for Reptile name)

In general, we do not look at log:

Scrapy Crawl Chouti--nolog

Project Framework diagram:

Where ' set start url.py ' is not required.

Scrapy.cfg is a simple configuration file.

Settings is a detailed configuration file

Item and pipelines are used for formatting, serialization

Middlewares is used to write middleware.

Folder Spiders is a crawler file, used to parse data, write callback function, etc., through the 2 yield to the leveling device and pipelines data

These are the simplest crawlers in the scrapy frame.

Example:

Here is an example of a beautiful picture of a beauty on the Internet, to experience the running flow of a simple scrapy crawler:

1 The code in the Spide folder in the crawler file Xiaohuar

#-*-coding:utf-8-*-Importscrapy fromScrapy.selectorImportSelector fromScrapy.httpImportRequest#Import Requests#Import Urllib.requestclassXiaohuarspider (scrapy. Spider): Name='Xiaohuar'Allowed_domains= ['xiaohuar.com'] Start_urls= ['http://www.xiaohuar.com/hua/']    defParse (self, response):PassHxS= Selector (Response=response)#not with BeautifulSoup. Text is treated as a parameterGirl_list = Hxs.xpath ('//*[@id = "List_img"]/div/div[1]/div')#copy-copy XPath from the browser        #//appears at the top of the expression from the entire HTML to find out otherwise represented from the descendants of the        #/cannot appear at the top only in the middle of the expression from the son if the following is the @ property name or text () to indicate the value of the property or        #.//And *//appear at the top of the list from the current descendants to find the most front is./or */or nothing to write a representation from the current son.        #img_list = []Count = 1 forGirlinchGirl_list:#it does take 25 objects here.            Print(count)#here Print from 1 to 25 proves that there are really 25 objects in girl_list but only the first 10 URLs are downloaded why? Count + = 1text= Girl.xpath ('Div[1]/div[2]/span/a/text ()'). Extract_first ()#find out about the homecoming            #self.filename = textimg = Girl.xpath ('div[1]/div[1]/a/img/@src'). Extract_first ()#self.url = ' http://www.xiaohuar.com ' + imgURL ='http://www.xiaohuar.com'+img Img_path= R'f:\ Reptile \%s.jpg'%text#res = request.get (URL). Content            #Urllib.request.urlretrieve (Url,img_path)            #img_list.append (URL)            #print (text,img)             from.. ItemsImportSp1item## yield Sp1item (Url=img_url, Text=self.filename)            yieldSp1item (Url=url, text=img_path) Result= Hxs.xpath ('//*[@id = "page"]/div/a/@href')        #print (Result)        #print (Result.extract_first ())        #print (Result.extract ())        #yield Item (xxxx) # go to item.py and go to piplines.py to do persistence. Here is the pseudo-code .        #Recursiveresult = Result.extract ()#sure enough. The code behind the list of strings will be executed correctly in the video. The teacher was negligent, but how to do it is unclear how the process of recursion is unclear         forUrl1inchResult#It turns out that if result is an object that is not a list, this code and the downstream code are invalid            #print (URL)            yieldRequest (Url=url1,callback=self.parse)#The URL is assigned to the 9th line of Start_urls and then back to parse to re-execute
View Code

Mainly divided into analytic and 2 yield

Parsing is using the module

From Scrapy.selector import Selector

2 yield, respectively, for persisting and iterating over the pages of a picture

Next comes the code in item and pipelines

ImportscrapyclassSp1item (scrapy. Item):#Define the fields for your item here is like:    #name = Scrapy. Field ()    #image_urls = scrapy. Field ()    #images = scrapy. Field ()    #image_path = scrapy. Field ()    #PassURL =Scrapy. Field () Text=Scrapy. Field ()#print (URL)    #print (text)
View Code
#Import Urllib.request#Import RequestsclassSp1pipeline (object):def __init__(self): Self.f=None#self.res = None        Pass    defProcess_item (self, item, spider):ImportRequests fromScrapy.httpImportRequest Res= Request (item['URL']) Self.f= Open (R'f:\ Reptile \%s.jpg'% item['text'],'WB') Self.f.write (res.body) self.f.close ()Print(item)## if Spider.name = = ' Xiaohuarvideo ':        #VName = R ' f:\ crawler \video\%s.mp4 '% item[' url ']        ## urllib.request.urlretrieve (item[' url '],vname)        #res = requests.get (item[' url ')        #with open (VName, ' WB ') as F:        #f.write (res.content)        #print ('%s download complete '% item[' url ')        #Pass        returnItemdefOpen_spider (self,spider):"""when the crawler starts executing, call the:p Aram Spider:: return:"""        Print('Reptile Start')        #self.f = open ('%s.jpg '% name, ' WB ')    defClose_spider (self, spider):"""when the crawler shuts down, it is called:p Aram Spider:: return:"""        Print('Reptile End')        #self.f.close ()
View Code

Of course, the configuration file settings to set up

#to set the recursive depth of a crawlDepth_limit = 1#Crawl responsibly by identifying yourself (and your website) on the User-agent#user_agent = ' SP1 (+http://www.yourdomain.com) '#whether to abide by the crawler protocol#Obey robots.txt Rules#Robotstxt_obey = TrueRobotstxt_obey =False#Configure maximum concurrent requests performed by Scrapy (default:16)#concurrent_requests =#Configure A delay for requests for the same website (default:0)#See Https://doc.scrapy.org/en/latest/topics/settings.html#download-delay#See also autothrottle settings and Docs#Delay Download seconds#Download_delay = 3#The download delay setting would honor only one of:#Concurrent_requests_per_domain =#concurrent_requests_per_ip =#Disable Cookies (enabled by default)#cookies_enabled = False#Disable Telnet Console (enabled by default)#telnetconsole_enabled = False#Override The default request headers:#default_request_headers = {#' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ',#' accept-language ': ' en ',#}#Enable or disable spider middlewares#See https://doc.scrapy.org/en/latest/topics/spider-middleware.html#spider_middlewares = {#' sp1.middlewares.Sp1SpiderMiddleware ': 543,# }#Enable or disable downloader middlewares#See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#downloader_middlewares = {#' sp1.middlewares.Sp1DownloaderMiddleware ': 543,#}#Enable or disable extensions#See https://doc.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {#' scrapy.extensions.telnet.TelnetConsole ': None,#}#Configure Item Pipelines#See https://doc.scrapy.org/en/latest/topics/item-pipeline.html#set the persisted file path and its priority, typically from 0 to 1000, the smaller the number the higher the priorityItem_pipelines = {'Sp1.pipelines.Sp1Pipeline': 300}
View Code

Finally climbed down more than 1000 pretty Little sisters (though actually all younger than me) pictures of

Of course, Scrapy also has a lot of advanced features, and the example is just the basic scrapy crawler OH

Learning experience:

The Scrapy framework is the most mainstream framework in crawlers, through the learning of these days. Makes me feel the necessity of the framework:

It is so powerful that we especially weak chicken programming enthusiasts can also make a look good performance, perfectly formed of the small reptile.

Framework learning and use let me experience what is called good code, such as high scalability, the need to know the existence of the configuration file. As long as the configuration file in accordance with their own actual needs of small changes, you can achieve another effect

。 Of course, the middleware and the signal is more powerful, which I need to learn AH. One last word: Life is short, I use python!

The three scrapy framework of Python Learning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.