1. New Project
Scrapy Start_project Book_project
2. Writing the Items category
3. Writing Spider class
#-*-coding:utf-8-*-Importscrapy fromBook_project.itemsImportBookitemclassBookinfospider (scrapy. Spider): name="BookInfo"#define the name of the crawlerAllowed_domains = ["allitebooks.com","Amazon.com"]#Define the extent of a crawlStart_urls = [ "http://www.allitebooks.com/security/", ] defParse (self, response): num_pages=int (response.xpath ('//span[contains (@class, "pages")]/text ()'). extract_first (). Split () [-2]) Base_url="http://www.allitebooks.com/security/page/{0}/" forPageinchRange (1, num_pages):yieldScrapy. Request (base_url.format page), dont_filter=true, callback=Self.parse_page)#Open (' my.txt ', ' WB '). write (page) defparse_page (self, response): forSelinchResponse.xpath ('//div/article'): Book_detail_url= Sel.xpath ('Div/header/h2/a/@href'). Extract_first ()yieldScrapy. Request (book_detail_url, callback=Self.parse_book_info)defparse_book_info (self, response): title= Response.css ('. Single-title'). XPath ('text ()'). extract_first () ISBN= Response.xpath ('//dd[2]/text ()'). extract_first () Item=Bookitem () item['title'] =title item['ISBN'] =ISBN#Yield ItemAmazon_search_url ='https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords='+ISBNyieldScrapy. Request (amazon_search_url, headers={'user-agent':'mozilla/5.0'}, callback=self.parse_price, meta={'Item': Item}) defparse_price (self, response): Item= response.meta['Item'] item[' price'] = Response.xpath ('//span/text ()'). Re (r'\$[0-9]+\. [0-9] {2}?') [0]yieldItem
View Code
4. Start Crawler Crawl
Scrapy Crawl Bookinfo-o Books.csv
5. Running Results
Looks like it's been running for a long time, about 25 Minutes.
Data in CSV
Python crawls information from a book site using the Scrapy framework