How to install Scrapy and create a new project

Source: Internet
Author: User
Tags xpath

Today we have written a scrapy introductory tutorial to help you install Scrapy and create new projects.

1, first need to install the following software
Python 2.7
lxml
Openssl
Pip or Easy_install

2. Install prerequisite software
sudo apt-get install Libevent-dev
sudo apt-get install Python-dev
sudo apt-get install Libxml2-dev
sudo apt-get install Libxslt1-dev
sudo apt-get install Python-setuptools
<!--more--


3, Installation Scrapy
sudo apt-get install Scrapy

Create a project

Take the ebook seed link, name, and size in crawl Mininova as an example
1. Enter the directory where the code is stored, and run the following command
Scrapy Startproject Mininova
The command creates a Mininova directory with the following content

mininova/

Scrapy.cfg

mininova/

__init__.py

items.py

pipelines.py

settings.py

spiders/

__init__.py

...

These files are:

SCRAPY.CFG: Configuration file for Project

mininova/: The project's Python module.

mininova/items.py: Item file in Project

mininova/pipelines.py: Pipelines file in Project

mininova/settings.py: Setup file for Project

mininova/spiders/: directory where spider code is placed

2. Define Item
Edit the items.py file in the Mininova directory:

Import Scrapy

Class Mininovaitem (Scrapy. Item):

title = Scrapy. Field ()

link = scrapy. Field ()

Size = Scrapy. Field ()

3. Write the first crawler (spider)
In order to create a spider, you must inherit scrapy. Spider class, and define three properties:

Name: Used to differentiate the spider. The name must be unique and cannot be set to the same name for different spiders.

Start_urls: Contains a list of URLs that spiders crawl at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL.

Parse (): is a method of the spider. When called, the response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response data), extracting it (generating item), and generating a request object that requires further processing of the URL.

Here is our first spider code, saved in the mininova_spider.py file in the Mininova/spiders directory:

Import scrapyfrom mininova.items Import Mininovaitem

Class Mininovaspider (Scrapy. Spider):

name = ' Mininova '

Allowed_domains = [' mininova.org ']

Start_urls = [' HTTP://WWW.MININOVA.ORG/SUB/50/NAME/1 ']

Def parse (self,response):

Sites = Response.xpath ('//table[@class = ' maintable ']//tr ')

For site in sites:

item = Mininovaitem ()

item[' title ' = Site.xpath (' td/a[not (@class = "ti com")]/text () '). Extract ()

For URLs in Site.xpath (' td/a[@class = "DL"]/@href '). Extract ():

item[' link ' = ' http://www.mininova.org ' + URL

For size in Site.xpath (' Td[3]/text () '). Extract ():

Size = Size.encode (' Utf-8 ')

item[' Size ' = size.replace (' \xc2\xa0 ', ')

Yield item

4. Crawling
Enter the project's root directory and execute the following command to start the spider:
Scrapy Crawl Mininova
5. Save the crawled data
Scrapy Crawl Mininova-o Items.json


From: Pinterest/Stray Eagle

How to install Scrapy and create a new project

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.