Target task: Use the Scrapy framework to crawl all large categories, small classes, sub-links in small categories, and the news content of the child link page, and finally save to local.
The large class is shown in small classes as follows:
Click on this small domestic category, into the page after the effect such as (part):
View the page elements and get the sub-links in the small class as shown:
With a child link you can send a request to access the content of the corresponding news.
First create the Scrapy project
# Create a project scrapy startproject sinanews # Creating Crawlers " sina.com.cn "
One, create the item file based on the field you want to crawl:
#-*-coding:utf-8-*-ImportscrapyImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewsitem (scrapy. Item):#headings and URLs for large classesParenttitle =Scrapy. Field () Parenturls=Scrapy. Field ()#headings and sub-URLs for small classesSubTitle =Scrapy. Field () Suburls=Scrapy. Field ()#Small Class Directory storage pathSubfilename =Scrapy. Field ()#Sub-links under small classesSonurls =Scrapy. Field ()#article title and contentHead =Scrapy. Field () content= Scrapy. Field ()
Second, the preparation of Spiders crawler files
#-*-coding:utf-8-*-ImportscrapyImportOS fromSinanews.itemsImportSinanewsitemImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinaspider (scrapy. Spider): Name="Sina"Allowed_domains= ["sina.com.cn"] Start_urls= ['http://news.sina.com.cn/guide/'] defParse (self, Response): Items= [] #URLs and headings for all large classesParenturls = Response.xpath ('//div[@id = "Tab01"]/div/h3/a/@href'). Extract () Parenttitle= Response.xpath ('//div[@id = "Tab01"]/div/h3/a/text ()'). Extract ()#all small classes of ur and titleSuburls = Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/@href'). Extract () SubTitle= Response.xpath ('//div[@id = "Tab01"]/div/ul/li/a/text ()'). Extract ()#Crawl all big classes forIinchRange (0, Len (parenttitle)):#specifies the path and directory name of a large class directoryParentfilename ="./data/"+Parenttitle[i]#If the directory does not exist, the directory is created if( notos.path.exists (parentfilename)): Os.makedirs (parentfilename)#Crawl all small classes forJinchRange (0, Len (suburls)): Item=Sinanewsitem ()#Save large categories of title and URLsitem['Parenttitle'] =Parenttitle[i] item['Parenturls'] =Parenturls[i]#check if the URL of the small class starts with the same category large class URL, if it is true (sports.sina.com.cn and Sports.sina.com.cn/nba)If_belong = Suburls[j].startswith (item['Parenturls']) #If you belong to this class, place the storage directory in this large class directory if(If_belong): Subfilename=parentfilename +'/'+Subtitle[j]#If the directory does not exist, the directory is created if( notos.path.exists (subfilename)): Os.makedirs (subfilename)#store small class URLs, title, and filename field dataitem['Suburls'] =Suburls[j] item['SubTitle'] =Subtitle[j] item['Subfilename'] =subfilename items.append (item)#send request requests for each small class URL, get response together with the meta data to give the callback function Second_parse method processing forIteminchItems:yieldScrapy. Request (url = item['Suburls'], meta={'meta_1': item}, callback=self.second_parse)#for the URL of the small class returned, then the recursive request defSecond_parse (Self, Response):#extract meta data for each responsemeta_1= response.meta['meta_1'] #Take out all the sub-links in the small classSonurls = Response.xpath ('//a/@href'). Extract () Items= [] forIinchRange (0, Len (sonurls)):#Check that each link starts with a large class URL, ends with a. sHTML, and returns True if it isIf_belong = Sonurls[i].endswith ('. shtml') andSonurls[i].startswith (meta_1['Parenturls']) #get field values under the same item for easy Transfer If you belong to this class if(If_belong): Item=Sinanewsitem () item['Parenttitle'] =meta_1['Parenttitle'] item['Parenturls'] =meta_1['Parenturls'] item['Suburls'] = meta_1['Suburls'] item['SubTitle'] = meta_1['SubTitle'] item['Subfilename'] = meta_1['Subfilename'] item['Sonurls'] =Sonurls[i] Items.append (item)#send request requests for each small type of the URL of the link, get response and give the callback function together with the Meta data detail_parse method processing forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together forContent_oneinchcontent_list:content+=Content_one item['Head']=Head item['content']=contentyieldItem
III. Preparation of pipelines documents
#-*-coding:utf-8-*- fromScrapyImportSignalsImportsysreload (SYS) sys.setdefaultencoding ("Utf-8")classSinanewspipeline (object):defProcess_item (self, item, spider): Sonurls= item['Sonurls'] #The file name is the middle part of the child link URL and is/is replaced by _, saved as a. txt formatfilename = Sonurls[7:-6].replace ('/','_') filename+=". txt"FP= Open (item['Subfilename']+'/'+filename,'W') Fp.write (item['content']) fp.close ()returnItem
Iv. Settings for settings files
# set up pipeline files Item_pipelines = { 'sinaNews.pipelines.SinanewsPipeline': +,}
Execute Command
Scrapy Crwal Sina
The effect is as follows:
Open the Data directory under the working directory to display the large categories of folders
open a large class folder, display the small category folder:
Open a small class folder to display the article:
Python crawler Framework Scrapy example (ii)