Detailed description of the python crawler framework scrapy instance

Source: Internet
Author: User
Detailed description of the python crawler framework scrapy instance generation project

Scrapy provides a tool to generate a project. some files are preset in the generated project. you need to add your own code to these files.

Open the command line and run scrapy startproject tutorial. the generated project is similar to the following structure:

Tutorial/

Scrapy. cfg

Tutorial/

_ Init _. py

Items. py

Pipelines. py

Settings. py

Spiders/

_ Init _. py

...

Scrapy. cfg is the configuration file of the project.

The spider written by the user should be placed under the spiders Directory. a spider is similar

from scrapy.spider import BaseSpiderclass DmozSpider(BaseSpider):    name = "dmoz"    allowed_domains = ["dmoz.org"]    start_urls = [        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"    ]    def parse(self, response):        filename = response.url.split("/")[-2]        open(filename, 'wb').write(response.body)

The name attribute is very important. different spider cannot use the same name.

Start_urls is the starting point for spider to Capture webpages. it can contain multiple URLs.

The parse method is called by default after the spider captures a webpage. instead of using this name to define its own method.

After the spider obtains the url content, it will call the parse method and pass a response parameter to it. the response contains the content of the captured webpage. in the parse method, you can parse data from the captured webpage. The above code simply saves the webpage content to a file.


Start crawling

You can open the command line, enter the generated project root directory tutorial/, execute scrapy crawl dmoz, and dmoz is the name of spider.


Parse webpage content

Scrapy provides a convenient way to parse data from a web page, which requires the use of HtmlXPathSelector

from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelectorclass DmozSpider(BaseSpider):    name = "dmoz"    allowed_domains = ["dmoz.org"]    start_urls = [        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"    ]    def parse(self, response):        hxs = HtmlXPathSelector(response)        sites = hxs.select('//ul/li')        for site in sites:            title = site.select('a/text()').extract()            link = site.select('a/@href').extract()            desc = site.select('text()').extract()            print title, link, desc

HtmlXPathSelector uses Xpath to parse data

// Ul/li indicates selecting the li tags under all ul tags

A/@ href indicates selecting the href attribute of all a Tags

A/text () indicates selecting a label text

A [@ href = "abc"] indicates selecting all the tags whose href attributes are abc.

We can save the parsed data in an object that can be used by scrapy, and then scrapy can help us save these objects, instead of saving the data to the file by ourselves. We need to add some classes in items. py to describe the data we want to save.

From scrapy. item import Item, Fieldclass d1_item (Item): title = Field () link = Field () desc = Field () and then in the parse method of spider, we save the parsed data in the DomzItem object. From scrapy. spider import BaseSpiderfrom scrapy. selector import HtmlXPathSelectorfrom tutorial. items import DmozItemclass DmozSpider (BaseSpider): name = "dmoz" allowed_domains = ["dsf-.org"] start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse (self, response ): hxs = HtmlXPathSelector (response) sites = hxs. select ('// ul/li') items = [] for site in sites: item = d1_item () item ['title'] = site. select ('a/text ()'). extract () item ['link'] = site. select ('a/@ href '). extract () item ['desc'] = site. select ('text ()'). extract () items. append (item) return items

When executing scrapy on the command line, we can add two parameters so that scrapy can output the items returned by the parse method to the json file.

Scrapy crawl dmoz-o items. json-t json

Items. json will be placed in the project Root Directory


Allows scrapy to automatically capture all links on the webpage

In the above example, scrapy only captures the content of the two URLs in start_urls, but what we usually want to achieve is that scrapy automatically discovers all links on a Web page, then, capture the content of these links. To achieve this, we can extract the required links in the parse method, construct some Request objects, and return them. scrapy will automatically capture these links. The code is similar:

class MySpider(BaseSpider):    name = 'myspider'    start_urls = (        'http://example.com/page1',        'http://example.com/page2',        )    def parse(self, response):        # collect `item_urls`        for item_url in item_urls:            yield Request(url=item_url, callback=self.parse_item)    def parse_item(self, response):        item = MyItem()        # populate `item` fields        yield Request(url=item_details_url, meta={'item': item},            callback=self.parse_details)    def parse_details(self, response):        item = response.meta['item']        # populate more `item` fields        return item

Parse is the default callback. It returns a Request list. scrapy automatically crawls webpages based on this list. when a webpage is captured, parse_item is called and parse_item returns a list, scrapy will capture the webpage based on this list, and then call parse_details

In order to make this work easier, scrapy provides another spider base class, which we can easily implement automatic crawling of links. we need to use CrawlSpider.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorclass MininovaSpider(CrawlSpider):    name = 'mininova.org'    allowed_domains = ['mininova.org']    start_urls = ['http://www.mininova.org/today']    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+'])),             Rule(SgmlLinkExtractor(allow=['/abc/\d+']), 'parse_torrent')]    def parse_torrent(self, response):        x = HtmlXPathSelector(response)        torrent = TorrentItem()        torrent['url'] = response.url        torrent['name'] = x.select("//h1/text()").extract()        torrent['description'] = x.select("//p[@id='description']").extract()        torrent['size'] = x.select("//p[@id='info-left']/p[2]/text()[2]").extract()        return torrent

Compared with BaseSpider, the new class has one more rules attribute, which is a list that can contain multiple Rule. each Rule describes which links need to be crawled and which do not. This is the document http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule for the Rule class.

These rule can have callback or none. when there is no callback, scrapy simply follow all these links.

Use of pipelines. py

In pipelines. py, we can add some classes to filter out the items we don't want and save the items to the database.

from scrapy.exceptions import DropItemclass FilterWordsPipeline(object):    """A pipeline for filtering out items which contain certain words in their    description"""    # put all words in lowercase    words_to_filter = ['politics', 'religion']    def process_item(self, item, spider):        for word in self.words_to_filter:            if word in unicode(item['description']).lower():                raise DropItem("Contains forbidden word: %s" % word)        else:            return item

If the item does not meet the requirements, an exception is thrown, and the item will not be output to the json file.

To use pipelines, we also need to modify settings. py.

Add a row

ITEM_PIPELINES = ['dirbot. pipelines. FilterWordsPipeline ']

Now execute scrapy crawl dmoz-o items. json-t json. items that do not meet the requirements are filtered out.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.