One requirement for a recent lab project is that you need to crawl several (number of) article metadata (title, time, body, and so on) published by the site. The problem is that these sites are both old and small, and of course it is impossible to comply with microdata standards. This is when all Web pages share a set of default rules that do not guarantee proper crawling of information, and it is impractical to write a spider code on each page.
at this point, I was desperate to have a framework that could automatically crawl the information on these sites by writing only one copy of the spider code and maintaining crawl rules for multiple sites, and thankfully Scrapy could do that. Given that there is so little information at home and abroad about this, I have shared the experience and code of this time in this article.
In order to make this clear, I have divided into three articles to describe:
- Run the Scrapy spider in a programmatic way
- Use Scrapy to customize dynamically configurable crawlers
- Use Redis and SQLAlchemy to Scrapy item to re-weight and store
This article mainly describes how to run the Scrapy crawler programmatically.
Before starting this article, you need to be familiar with scrapy and know the concepts of Items, spiders, pipline, and Selector. If you are new to scrapy and want to learn how to start crawling a website with scrapy, it is recommended that you take a look at the official tutorials first.
Running a scrapy crawler can be initiated via the command Line ( scrapy runspider myspider.py
) or programmatically by using the core API. In order to achieve greater customization and flexibility, we mainly use the latter way.
We use the Dmoz example in the official tutorial to help us understand how to start the spider using Programming. Our spider file looks dmoz_spider.py
like this:
12345678910111213141516171819202122 |
import scrapy class dmozitem(scrapy. Item): title = scrapy. Field() link = scrapy. Field() desc = scrapy. Field() class dmozspider(scrapy. Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): For sel in response. XPath('//ul/li '): item = dmozitem () item[ ' title ' ]< Span class= "crayon-h" > = sel xpath ( ' A/text () ' ) extract () item[ ' link ' ]< Span class= "crayon-h" > = sel xpath ( ' A/@href ' extract () Item[' desc '] = sel. XPath(' text () '). Extract() yield item |
Next we need to write a script run.py
to run the Dmozspider:
123456789101112131415161718192021222324252627 |
from dmoz_spider import dmozspider # Scrapy APIfrom scrapy import signals, log from twisted. Internet import reactor from scrapy. Crawler import crawler from scrapy. Settings import settings def spider_closing(spider): "" " activates on spider closed signal " "" log. Msg("Closing reactor", level =log. INFO) reactor. Stop() log. Start(loglevel=log. DEBUG) settings = settings() # Crawl Responsiblysettings. Set("User_agent", "mozilla/5.0" (Windows NT 6.2; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/32.0.1667.0 safari/537.36 ") crawler = crawler(settings) # Stop reactor when Spider closescrawler. Signals. Connect(spider_closing, signal=signals. spider_closed) crawler. Configure() crawler. Crawl(dmozspider()) crawler. Start() reactor. Run() |
Then python run.py
it started our crawler, but because we didn't do any storage operations on the crawled results, we couldn't see the results. You can write an item pipline to store the data in the database, use the settings.set
interface to configure the Pipline ITEMS_PIPLINE
, and we'll cover this in the third article. The next blog will show you how to crawl data for individual sites by maintaining crawl rules for multiple sites.
You can see the full project of this article on GitHub.
Note: The scrapy version that is used in this article is the master branch on 0.24,github supported Scrapy 1.0
Three articles in this series
- Python crawler Framework Scrapy Tutorial (1)--Getting Started
- Python crawler Framework Scrapy Tutorial (2)-Dynamic configurable
- Python crawler Framework Scrapy Tutorial (3)-use Redis and SQLAlchemy to Scrapy item to de-weight and store
Resources
- Running Scrapy Spider Programmatically
Python crawler Framework Scrapy Tutorial (1)-Getting Started