Python crawler Framework Scrapy Tutorial (1)-Getting Started

Source: Internet
Author: User
Tags xpath

One requirement for a recent lab project is that you need to crawl several (number of) article metadata (title, time, body, and so on) published by the site. The problem is that these sites are both old and small, and of course it is impossible to comply with microdata standards. This is when all Web pages share a set of default rules that do not guarantee proper crawling of information, and it is impractical to write a spider code on each page.

at this point, I was desperate to have a framework that could automatically crawl the information on these sites by writing only one copy of the spider code and maintaining crawl rules for multiple sites, and thankfully Scrapy could do that. Given that there is so little information at home and abroad about this, I have shared the experience and code of this time in this article.

In order to make this clear, I have divided into three articles to describe:

    1. Run the Scrapy spider in a programmatic way
    2. Use Scrapy to customize dynamically configurable crawlers
    3. Use Redis and SQLAlchemy to Scrapy item to re-weight and store

This article mainly describes how to run the Scrapy crawler programmatically.

Before starting this article, you need to be familiar with scrapy and know the concepts of Items, spiders, pipline, and Selector. If you are new to scrapy and want to learn how to start crawling a website with scrapy, it is recommended that you take a look at the official tutorials first.

Running a scrapy crawler can be initiated via the command Line ( scrapy runspider myspider.py ) or programmatically by using the core API. In order to achieve greater customization and flexibility, we mainly use the latter way.

We use the Dmoz example in the official tutorial to help us understand how to start the spider using Programming. Our spider file looks dmoz_spider.py like this:

12345678910111213141516171819202122 import scrapy class dmozitem(scrapy. Item): title = scrapy. Field() link = scrapy. Field() desc = scrapy. Field() class dmozspider(scrapy. Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"     ] def parse(self, response): For sel in response. XPath('//ul/li '):              item = dmozitem ()              item[ ' title ' ]< Span class= "crayon-h" > = sel xpath ( ' A/text () ' ) extract ()              item[ ' link ' ]< Span class= "crayon-h" > = sel xpath ( ' A/@href ' extract () Item[' desc '] = sel. XPath(' text () '). Extract() yield item

Next we need to write a script run.py to run the Dmozspider:

123456789101112131415161718192021222324252627 from dmoz_spider import dmozspider # Scrapy APIfrom scrapy import signals, log from twisted. Internet import reactor from scrapy. Crawler import crawler from scrapy. Settings import settings def spider_closing(spider): "" " activates on spider closed signal " "" log. Msg("Closing reactor", level =log. INFO) reactor. Stop() log. Start(loglevel=log. DEBUG) settings = settings() # Crawl Responsiblysettings. Set("User_agent", "mozilla/5.0" (Windows NT 6.2; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/32.0.1667.0 safari/537.36 ") crawler = crawler(settings) # Stop reactor when Spider closescrawler. Signals. Connect(spider_closing, signal=signals. spider_closed) crawler. Configure() crawler. Crawl(dmozspider()) crawler. Start() reactor. Run()

Then python run.py it started our crawler, but because we didn't do any storage operations on the crawled results, we couldn't see the results. You can write an item pipline to store the data in the database, use the settings.set interface to configure the Pipline ITEMS_PIPLINE , and we'll cover this in the third article. The next blog will show you how to crawl data for individual sites by maintaining crawl rules for multiple sites.

You can see the full project of this article on GitHub.

Note: The scrapy version that is used in this article is the master branch on 0.24,github supported Scrapy 1.0

Three articles in this series

    1. Python crawler Framework Scrapy Tutorial (1)--Getting Started
    2. Python crawler Framework Scrapy Tutorial (2)-Dynamic configurable
    3. Python crawler Framework Scrapy Tutorial (3)-use Redis and SQLAlchemy to Scrapy item to de-weight and store
Resources
    • Running Scrapy Spider Programmatically

Python crawler Framework Scrapy Tutorial (1)-Getting Started

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.