Today we have written a scrapy introductory tutorial to help you install Scrapy and create new projects.
1, first need to install the following software
Python 2.7
lxml
Openssl
Pip or Easy_install
2. Install prerequisite software
sudo apt-get install Libevent-dev
sudo apt-get install Python-dev
sudo apt-get install Libxml2-dev
sudo apt-get install Libxslt1-dev
sudo apt-get install Python-setuptools
<!--more--
3, Installation Scrapy
sudo apt-get install Scrapy
Create a project
Take the ebook seed link, name, and size in crawl Mininova as an example
1. Enter the directory where the code is stored, and run the following command
Scrapy Startproject Mininova
The command creates a Mininova directory with the following content
mininova/
Scrapy.cfg
mininova/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These files are:
SCRAPY.CFG: Configuration file for Project
mininova/: The project's Python module.
mininova/items.py: Item file in Project
mininova/pipelines.py: Pipelines file in Project
mininova/settings.py: Setup file for Project
mininova/spiders/: directory where spider code is placed
2. Define Item
Edit the items.py file in the Mininova directory:
Import Scrapy
Class Mininovaitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
Size = Scrapy. Field ()
3. Write the first crawler (spider)
In order to create a spider, you must inherit scrapy. Spider class, and define three properties:
Name: Used to differentiate the spider. The name must be unique and cannot be set to the same name for different spiders.
Start_urls: Contains a list of URLs that spiders crawl at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL.
Parse (): is a method of the spider. When called, the response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response data), extracting it (generating item), and generating a request object that requires further processing of the URL.
Here is our first spider code, saved in the mininova_spider.py file in the Mininova/spiders directory:
Import scrapyfrom mininova.items Import Mininovaitem
Class Mininovaspider (Scrapy. Spider):
name = ' Mininova '
Allowed_domains = [' mininova.org ']
Start_urls = [' HTTP://WWW.MININOVA.ORG/SUB/50/NAME/1 ']
Def parse (self,response):
Sites = Response.xpath ('//table[@class = ' maintable ']//tr ')
For site in sites:
item = Mininovaitem ()
item[' title ' = Site.xpath (' td/a[not (@class = "ti com")]/text () '). Extract ()
For URLs in Site.xpath (' td/a[@class = "DL"]/@href '). Extract ():
item[' link ' = ' http://www.mininova.org ' + URL
For size in Site.xpath (' Td[3]/text () '). Extract ():
Size = Size.encode (' Utf-8 ')
item[' Size ' = size.replace (' \xc2\xa0 ', ')
Yield item
4. Crawling
Enter the project's root directory and execute the following command to start the spider:
Scrapy Crawl Mininova
5. Save the crawled data
Scrapy Crawl Mininova-o Items.json
From: Pinterest/Stray Eagle
How to install Scrapy and create a new project