Python web crawler Framework scrapy instructions for use

Source: Internet
Author: User
Tags xpath python web crawler

1 Creating a Project
Scrapy Startproject Tutorial

2 Defining the item
Import Scrapy
Class Dmozitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()
After the Paser data is saved to the item list, it is passed to pipeline using

3 Write the first crawler (spider), saved in the Tutorial/spiders directory dmoz_spider.py, the crawler to be based on the file name to start.
Import Scrapy

Class Dmozspider (Scrapy. Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
item = Dmozitem ()
item[' title '] = Sel.xpath (' A/text () '). Extract ()
item[' link ' = Sel.xpath (' A ' @href '). Extract ()
item[' desc '] = Sel.xpath (' text () '). Extract ()
Yield item

Start_urls Setting the list of URLs to crawl
The parse member function calls the extracted information from the page after a page has been crawled and is saved to the list of previously defined item dictionaries. Note Dmozitem The class defined for the second step

4 Pipeline
When item is collected in the spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order. Define the pipeline processing order in settings.py.
Pipline processes data while deciding whether to pass data to the next pipeline

Import JSON

Class Jsonwriterpipeline (object):

def __init__ (self):
Self.file = open (' Items.jl ', ' WB ')

def process_item (self, item, spider):
line = Json.dumps (Dict (item)) + "\ n"
Self.file.write (line)
Return item

5 Starting Crawler
Scrapy Crawl DMOZ

Python web crawler Framework scrapy instructions for use

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.