Python crawler Framework Scrapy Tutorial (1)-Getting Started

Last Update:2015-08-29 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One requirement for a recent lab project is that you need to crawl several (number of) article metadata (title, time, body, and so on) published by the site. The problem is that these sites are both old and small, and of course it is impossible to comply with microdata standards. This is when all Web pages share a set of default rules that do not guarantee proper crawling of information, and it is impractical to write a spider code on each page.

at this point, I was desperate to have a framework that could automatically crawl the information on these sites by writing only one copy of the spider code and maintaining crawl rules for multiple sites, and thankfully Scrapy could do that. Given that there is so little information at home and abroad about this, I have shared the experience and code of this time in this article.

In order to make this clear, I have divided into three articles to describe:

Run the Scrapy spider in a programmatic way
Use Scrapy to customize dynamically configurable crawlers
Use Redis and SQLAlchemy to Scrapy item to re-weight and store

This article mainly describes how to run the Scrapy crawler programmatically.

Before starting this article, you need to be familiar with scrapy and know the concepts of Items, spiders, pipline, and Selector. If you are new to scrapy and want to learn how to start crawling a website with scrapy, it is recommended that you take a look at the official tutorials first.

Running a scrapy crawler can be initiated via the command Line ( scrapy runspider myspider.py ) or programmatically by using the core API. In order to achieve greater customization and flexibility, we mainly use the latter way.

We use the Dmoz example in the official tutorial to help us understand how to start the spider using Programming. Our spider file looks dmoz_spider.py like this:

12345678910111213141516171819202122

import scrapy class dmozitem(scrapy. Item): title = scrapy. Field() link = scrapy. Field() desc = scrapy. Field() class dmozspider(scrapy. Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): For sel in response. XPath('//ul/li '): item = dmozitem () item[ ' title ' ]< Span class= "crayon-h" > = sel xpath ( ' A/text () ' ) extract () item[ ' link ' ]< Span class= "crayon-h" > = sel xpath ( ' A/@href ' extract () Item[' desc '] = sel. XPath(' text () '). Extract() yield item

Next we need to write a script run.py to run the Dmozspider:

123456789101112131415161718192021222324252627

from dmoz_spider import dmozspider # Scrapy APIfrom scrapy import signals, log from twisted. Internet import reactor from scrapy. Crawler import crawler from scrapy. Settings import settings def spider_closing(spider): "" " activates on spider closed signal " "" log. Msg("Closing reactor", level =log. INFO) reactor. Stop() log. Start(loglevel=log. DEBUG) settings = settings() # Crawl Responsiblysettings. Set("User_agent", "mozilla/5.0" (Windows NT 6.2; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/32.0.1667.0 safari/537.36 ") crawler = crawler(settings) # Stop reactor when Spider closescrawler. Signals. Connect(spider_closing, signal=signals. spider_closed) crawler. Configure() crawler. Crawl(dmozspider()) crawler. Start() reactor. Run()

Then python run.py it started our crawler, but because we didn't do any storage operations on the crawled results, we couldn't see the results. You can write an item pipline to store the data in the database, use the settings.set interface to configure the Pipline ITEMS_PIPLINE , and we'll cover this in the third article. The next blog will show you how to crawl data for individual sites by maintaining crawl rules for multiple sites.

You can see the full project of this article on GitHub.

Note: The scrapy version that is used in this article is the master branch on 0.24,github supported Scrapy 1.0

Three articles in this series

Python crawler Framework Scrapy Tutorial (1)--Getting Started
Python crawler Framework Scrapy Tutorial (2)-Dynamic configurable
Python crawler Framework Scrapy Tutorial (3)-use Redis and SQLAlchemy to Scrapy item to de-weight and store

Resources

Running Scrapy Spider Programmatically

Python crawler Framework Scrapy Tutorial (1)-Getting Started

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More