Python crawler Framework Scrapy example (ii)

Source: Internet
Author: User
Tags xpath

Target task: Use the Scrapy framework to crawl all large categories, small classes, sub-links in small categories, and the news content of the child link page, and finally save to local.

The large class is shown in small classes as follows:

Click on this small domestic category, into the page after the effect such as (part):

View the page elements and get the sub-links in the small class as shown:

With a child link you can send a request to access the content of the corresponding news.

First create the Scrapy project

# Create a project scrapy startproject sinanews # Creating Crawlers " sina.com.cn "

One, create the item file based on the field you want to crawl:

#-*-coding:utf-8-*-ImportscrapyImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewsitem (scrapy. Item):#headings and URLs for large classesParenttitle =Scrapy. Field () Parenturls=Scrapy. Field ()#headings and sub-URLs for small classesSubTitle =Scrapy. Field () Suburls=Scrapy. Field ()#Small Class Directory storage pathSubfilename =Scrapy. Field ()#Sub-links under small classesSonurls =Scrapy. Field ()#article title and contentHead =Scrapy. Field () content= Scrapy. Field ()

Second, the preparation of Spiders crawler files

#-*-coding:utf-8-*-ImportscrapyImportOS fromSinanews.itemsImportSinanewsitemImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinaspider (scrapy. Spider): Name="Sina"Allowed_domains= ["sina.com.cn"] Start_urls= ['http://news.sina.com.cn/guide/']          defParse (self, Response): Items= []        #URLs and headings for all large classesParenturls = Response.xpath ('//div[@id = "Tab01"]/div/h3/a/@href'). Extract () Parenttitle= Response.xpath ('//div[@id = "Tab01"]/div/h3/a/text ()'). Extract ()#all small classes of ur and titleSuburls = Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/@href'). Extract () SubTitle= Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/text ()'). Extract ()#Crawl all big classes         forIinchRange (0, Len (parenttitle)):#specifies the path and directory name of a large class directoryParentfilename ="./data/"+Parenttitle[i]#If the directory does not exist, the directory is created            if( notos.path.exists (parentfilename)): Os.makedirs (parentfilename)#Crawl all small classes             forJinchRange (0, Len (suburls)): Item=Sinanewsitem ()#Save large categories of title and URLsitem['Parenttitle'] =Parenttitle[i] item['Parenturls'] =Parenturls[i]#check if the URL of the small class starts with the same category large class URL, if it is true (sports.sina.com.cn and Sports.sina.com.cn/nba)If_belong = Suburls[j].startswith (item['Parenturls'])                #If you belong to this class, place the storage directory in this large class directory                if(If_belong): Subfilename=parentfilename +'/'+Subtitle[j]#If the directory does not exist, the directory is created                    if( notos.path.exists (subfilename)): Os.makedirs (subfilename)#store small class URLs, title, and filename field dataitem['Suburls'] =Suburls[j] item['SubTitle'] =Subtitle[j] item['Subfilename'] =subfilename items.append (item)#send request requests for each small class URL, get response together with the meta data to give the callback function Second_parse method processing         forIteminchItems:yieldScrapy. Request (url = item['Suburls'], meta={'meta_1': item}, callback=self.second_parse)#for the URL of the small class returned, then the recursive request    defSecond_parse (Self, Response):#extract meta data for each responsemeta_1= response.meta['meta_1']        #Take out all the sub-links in the small classSonurls = Response.xpath ('//a/@href'). Extract () Items= []         forIinchRange (0, Len (sonurls)):#Check that each link starts with a large class URL, ends with a. sHTML, and returns True if it isIf_belong = Sonurls[i].endswith ('. shtml') andSonurls[i].startswith (meta_1['Parenturls'])            #get field values under the same item for easy Transfer If you belong to this class            if(If_belong): Item=Sinanewsitem () item['Parenttitle'] =meta_1['Parenttitle'] item['Parenturls'] =meta_1['Parenturls'] item['Suburls'] = meta_1['Suburls'] item['SubTitle'] = meta_1['SubTitle'] item['Subfilename'] = meta_1['Subfilename'] item['Sonurls'] =Sonurls[i] Items.append (item)#send request requests for each small type of the URL of the link, get response and give the callback function together with the Meta data detail_parse method processing         forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content    defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together         forContent_oneinchcontent_list:content+=Content_one item['Head']=Head item['content']=contentyieldItem

III. Preparation of pipelines documents

#-*-coding:utf-8-*- fromScrapyImportSignalsImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewspipeline (object):defProcess_item (self, item, spider): Sonurls= item['Sonurls']        #The file name is the middle part of the child link URL and is/is replaced by _, saved as a. txt formatfilename = Sonurls[7:-6].replace ('/','_') filename+=". txt"FP= Open (item['Subfilename']+'/'+filename,'W') Fp.write (item['content']) fp.close ()returnItem

Iv. Settings for settings files

# set up pipeline files Item_pipelines = {   'sinaNews.pipelines.SinanewsPipeline': +,}

Execute Command

Scrapy Crwal Sina

The effect is as follows:

Open the Data directory under the working directory to display the large categories of folders

open a large class folder, display the small category folder:

Open a small class folder to display the article:

Python crawler Framework Scrapy example (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.