Create a Scrapy item definition extract item write crawl site spider and extract item write item Pipeline to store extracted item (i.e. data)
Scrapy is written by Python. If you have just contacted and wondered about the nature of the language and the details of scrapy, we recommend Learn python the Hard Way for programmers who are already familiar with other languages and want to learn python quickly, for beginners who want to start learning from Python, A list of non-programmer python learning materials will be your choice. Create a project
Before you start crawling, you must create a new Scrapy project. Enter the directory where you intend to store the code, and run the following command:
Scrapy Startproject Tutorial
The command will create a tutorial directory that contains the following:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py spiders/
__init__.py
...
These files are: Scrapy.cfg: The project's profile tutorial/: The project's Python module. You will then join the code here. tutorial/items.py: Item file in the project. tutorial/pipelines.py: Pipelines files in the project. tutorial/settings.py: The project's settings file. tutorial/spiders/: The directory where the spider code is placed. Define Item
The Item is the container that holds the crawled data, is used like a Python dictionary, and provides additional protection against undefined field errors caused by spelling errors.
Like you do in ORM, you can create a scrapy. Item class, and the definition type is scrapy. The class property of the Field to define an item. (If you don't know Orm, don't worry, you'll find this step very simple)
First, the item is modeled according to the data that is acquired from dmoz.org. We need to get the name, URL, and description of the site from the DMOZ. For this, define the corresponding field in the item. To edit the items.py file in the Tutorial directory:
Import Scrapy
class Dmozitem (scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()
This may seem a bit complicated at first, but by defining the item, you can easily use other methods of scrapy. And these methods need to know the definition of your item. Write the first reptile (Spider)
Spider is a class that users write to crawl data from a single Web site (or some Web sites).
It contains an initial URL for downloading, how to follow links in a Web page, and how to analyze the contents of a page to extract the method for generating the item.
In order to create a spider, you must inherit scrapy. Spider class, and defines the following three properties: Name: Used to distinguish Spider. The name must be unique, and you may not set the same name for different spider. Start_urls: Contains a list of URLs that spider crawled at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL. Parse () is a method of spider. When invoked, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response), extracting data (generating item), and generating the Request object for URLs that need further processing.
The following is our first spider code, saved in the dmoz_spider.py file in the Tutorial/spiders directory:
Import Scrapy
class Dmozspider (scrapy. Spider):
name = "DMOZ"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/ programming/languages/python/books/",
" http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"
]
def parse (self, response):
filename = response.url.split ("/") [-2]
with open (filename, "WB") as F:
F.write (Response.body)
crawling
Go to the root of the project and start the spider by executing the following command:
Scrapy Crawl DMOZ
Crawl DMOZ initiates the spider for crawling dmoz.org, you will get similar output:
2014-01-23 18:13:07-0400 [Scrapy] info:scrapy started (bot:tutorial) 2014-01-23 18:13:07-0400 [scrapy] info:optional F
Eatures Available: ... 2014-01-23 18:13:07-0400 [scrapy] Info:overridden settings: {} 2014-01-23 18:13:07-0400 [scrapy] info:enabled extensions
: ...
2014-01-23 18:13:07-0400 [scrapy] info:enabled downloader middlewares: ... 2014-01-23