Web Crawler: uses the Scrapy framework to compile a crawler service that crawls book information. scrapy

Source: Internet
Author: User

Web Crawler: uses the Scrapy framework to compile a crawler service that crawls book information. scrapy
Last week, I learned the basic knowledge of BeautifulSoup and used it to complete a web crawler (using Beautiful Soup to compile a crawler series summary). BeautifulSoup is a very popular Python network crawling library, it provides a Python object based on the HTML structure. Although it is easy to understand and can process HTML data very well, BeautifulSoup has one of the biggest disadvantages compared with Scrapy: slow.
Scrapy is an open-source Python Data Capturing framework. It is fast, powerful, and easy to use. Let's look at the simple and complete crawler on a website homepage: asynchronous. That is to say, Scrapy does not need to wait for a request to complete before processing the next request, but initiates another request at the same time. In addition, another benefit of asynchronous requests is that when a request fails, other requests will not be affected.
Installation (Mac)

pip install scrapy
For other operating systems, see complete installation guide: http://doc.scrapy.org/en/latest/intro/install.html
Several concepts to be understood in Scrapy
SpidersWhat the Spider class wants to express is: how to capture the data of a fixed website. For example, in start_urls, you can define which link to capture and what data to capture in the parse () method. When a Spider starts execution, it first initiates a request from the first link in start_urls () and then processes the returned data in callback.
ItemsThe Item class provides formatted data and can be understood as a data Model class.
SelectorsThe Selector class of Scrapy is based on the lxml library and provides HTML or XML Conversion functions. The Selector instance generated using the response object as the parameter can obtain the node data through the xpath () method of the Instance Object.
Write a Web Crawler
Next, we will rewrite the previous example of capturing book information in Beautiful Soup (using Beautiful Soup to compile a crawler series summary) to Scrapy.
Create a project
scrapy startproject book_project
This command creates a project named book_project.
Compile the Item class, that is, the object class. The Code is as follows:
import scrapyclass BookItem(scrapy.Item):    title = scrapy.Field()    isbn = scrapy.Field()    price = scrapy.Field()
Write the Spider class to set the name of the Spider, the domain name that can be crawled and from which link:
import scrapyfrom book_project.items import BookItemclass BookInfoSpider(scrapy.Spider):    name = "bookinfo"    allowed_domains = ["allitebooks.com", "amazon.com"]    start_urls = [        "http://www.allitebooks.com/security/",    ]

 

How to traverse paging data
def parse(self, response):    num_pages = int(response.xpath('//a[contains(@title, "Last Page →")]/text()').extract_first())    base_url = "http://www.allitebooks.com/security/page/{0}/"    for page in range(1, num_pages):        yield scrapy.Request(base_url.format(page), dont_filter=True, callback=self.parse_page)

'// A' indicates all the labels;
'// A [contains (@ title, "Last Page →")' indicates that the title attribute contains the tag of "Last Page →" in all a tags;
The extract () method parses and returns node data that meets the condition.

How to get book information from allitebooks.com
def parse_page(self, response):        for sel in response.xpath('//div/article'):            book_detail_url = sel.xpath('div/header/h2/a/@href').extract_first()            yield scrapy.Request(book_detail_url, callback=self.parse_book_info)def parse_book_info(self, response):    title = response.css('.single-title').xpath('text()').extract_first()    isbn = response.xpath('//dd[2]/text()').extract_first()    item = BookItem()    item['title'] = title    item['isbn'] = isbn    amazon_search_url = 'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=' + isbn    yield scrapy.Request(amazon_search_url, callback=self.parse_price, meta={ 'item': item })

 

How to get the book price from amazon.com
def parse_price(self, response):    item = response.meta['item']    item['price'] = response.xpath('//span/text()').re(r'\$[0-9]+\.[0-9]{2}?')[0]    yield item

Start the service and start crawling
scrapy crawl bookinfo -o books.csv
The-o books.csv parameter indicates that the captured Item set is output to the csv file. In addition to CSV format, Scrapy also supports JSON and XML format input. See http://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports for details
Result:Summary of the basic series of data analysis using Python

Big Data, big data analysis, BeautifulSoup, Beautiful Soup, Scrapy, Scrapy crawler, data mining, data analysis, data processing, pandas, web crawler, web scraper, python excel, python writes excel Data, python processes csv files Scrapy csv, python operations Excel, excel reads and writes Scrapy framework entry big data, big data analysis, BeautifulSoup, Beautiful Soup entry, Scrapy, scrapy crawler, data mining, data analysis, data processing, pandas, web crawler, web scraper, python excel, python write excel Data, python process csv file Scrapy csv, python operation Excel, introduction to excel Scrapy framework big data, big data analysis, BeautifulSoup, Beautiful Soup, Scrapy, Scrapy crawler, data mining, data analysis, data processing, pandas, web crawler, web scraper, python excel, python writing excel Data, python processing csv file Scrapy csv, python operation Excel, excel reading and writing Scrapy framework entry big data, big data analysis, BeautifulSoup, beautiful Soup entry, Scrapy, Scrapy crawler, data mining, data analysis, data processing, pandas, web crawler, web scraper, python excel, python write excel Data, python process csv file Scrapy csv, python Excel operations, excel Scrapy framework entry big data, big data analysis, BeautifulSoup, Beautiful Soup entry, Scrapy, Scrapy crawler, data mining, data analysis, data processing, pandas, web Crawler, web scraper, python excel, python write excel Data, python process csv file Scrapy csv, python operation Excel, excel read/write Scrapy framework entry

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.