The pages that are often crawled are presented in a tree-like structure. For example, you need to crawl a directory before selecting a specific crawl target in the directory. The page structure is different between the directory and the target, which makes it impossible to use the same crawl strategy.
From previous experience, we operate on scrapy from the Spider's Parse () function as a portal. It is better to manipulate the directory in the parse () function, get the URL of the target, and crawl the content further.
The code is as follows:
1 classZhangzishispider (scrapy. Spider):2Name ='Zhangzishi'3Allowed_domains = ['http://www.zhangzishi.cc/']4Start_urls = ['Http://www.zhangzishi.cc/category/welfare']5 6 defParse_art (Self, Response):7IMGs =Zhangzishiartitem ()8imgs['Img_urls'] = Response.xpath ('//article//img//@src'). Extract ()9 yieldIMGsTen One defParse (self, response): AUrl_list_item =Zhangzishilistitem () -url_list_item['Art_urls'] = Response.xpath ('//article[@class = "Excerpt"]//h2//a[@target = "_blank"]//@href'). Extract () - #yield Url_list_item the forUrlinchurl_list_item['Art_urls']: - ifURL: - Print 'analysing article:\t'+URL - yieldScrapy. Request (URL, Callback=self.parse_art, dont_filter=true)
For parse, each time the item is fetched, a new request is issued through yield, the specific crawl target is obtained, and the corresponding response is processed in the Parse_art function
In fact, the change is not particularly large. Just a deeper understanding of these points:
- First, Scrapy is based on the URL in Start_urls, gets the first response, enters parse ()
- After we crawl to the two-layer URL, return response can specify the handler function
- Throw the corresponding item through yield and go to pipeline for processing
Python crawler scrapy Framework Primer (3)