Use scrapy to implement website crawling examples and web crawler (SPIDER) Steps

Last Update:2014-01-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Copy codeThe Code is as follows:
#! /Usr/bin/env python
#-*-Coding: UTF-8 -*-
From scrapy. contrib. spiders import crawler, Rule
From scrapy. contrib. linkextractors. sgml import SgmlLinkExtractor
From scrapy. selector import Selector

From cnbeta. items import CnbetaItem
Class CBSpider (crawler ):
Name = 'cnbeta'
Allowed_domains = ['cnbeta. com']
Start_urls = ['HTTP: // www.jb51.net']

Rules = (
Rule (SgmlLinkExtractor (allow = ('/articles/. * \. htm ',)),
Callback = 'parse _ page', follow = True ),
)

Def parse_page (self, response ):
Item = CnbetaItem ()
Sel = Selector (response)
Item ['title'] = sel. xpath ('// title/text ()'). extract ()
Item ['url'] = response. url
Return item

Steps for implementing SPIDER Crawlers

1. instance primary goal: capture the article list from the list page of a website and store it in the database. The database includes the article title, link, and time.

First, generate a project: scrapy startproject fjsen
First define items and open items. py:

We started modeling the project, and we wanted to capture the title, address, and time of the website. We defined the domain as these three attributes. In this way, we edit items. py and find it in the Open Directory directory. Our project looks like this:

Copy codeThe Code is as follows:
From scrapy. item import Item, Field
Class FjsenItem (Item ):
# Define the fields for your item here like:
# Name = Field ()
Title = Field ()
Link = Field ()
Addtime = Field ()

Step 2: Define a spider, which is a crawling spider (note that it is under the spiders folder of the Project). They determine the URL for a preliminary list download and how to follow the link, and how to analyze the contents of the extracted items in the page (the website we want to crawl is the link and time of all ten pages of the http://www.fjsen.com/j/node_94962.htm list ).
Create a new fjsen_spider.py with the following content:

Copy codeThe Code is as follows:
#-*-Coding: UTF-8 -*-
From scrapy. spider import BaseSpider
From scrapy. selector import HtmlXPathSelector
From fjsen. items import FjsenItem
Class FjsenSpider (BaseSpider ):
Name = "fjsen"
Allowed_domains = ["fjsen.com"]
Start_urls = ['HTTP: // www.fjsen.com/j/node_94962_'{str (x?{'.htm' for x in range ()] + ['HTTP: // then
Def parse (self, response ):
Hxs = HtmlXPathSelector (response)
Sites = hxs. select ('// ul/li ')
Items = []
For site in sites:
Item = FjsenItem ()
Item ['title'] = site. select ('A/text () '). extract ()
Item ['link'] = site. select ('A/@ href '). extract ()
Item ['addtime'] = site. select ('span/text () '). extract ()
Items. append (item)
Return items

Name: determines the name of a spider. It must be unique, that is, you cannot set different spider names.
Allowed_domains: this is obviously the allowed domain name, or the crawling range allowed by crawlers is limited to the domain name in this list.
Start_urls: A list of URLs. the SPIDER will start crawling. Therefore, the first page will be listed here for download. The subsequent URLs will generate the starting URLs that are successively included in the data. I will list ten list pages directly here.
Parse (): A method of the Spider. This function is executed when the Response object returned by each url that starts to be downloaded.
Here, I capture the data under <li> under <ul> on each list page, including the title, link, and time, and insert it into a list.

Step 3: Save the captured data to the database. You have to modify it in the pipelines. py file.
Copy codeThe Code is as follows:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
From OS import path
From scrapy import signals
From scrapy. xlib. pydispatch import dispatcher
Class FjsenPipeline (object ):

Def _ init _ (self ):
Self. conn = None
Dispatcher. connect (self. initialize, signals. engine_started)
Dispatcher. connect (self. finalize, signals. engine_stopped)
Def process_item (self, item, spider ):
Self.conn.exe cute ('insert into fjsen values (?,?,?,?) ', (None, item ['title'] [0], 'HTTP: // www.jb51.net/'{item}'link'{0},item}'addtime'{0])
Return item
Def initialize (self ):
If path. exists (self. filename ):
Self. conn = sqlite3.connect (self. filename)
Else:
Self. conn = self. create_table (self. filename)
Def finalize (self ):
If self. conn is not None:
Self. conn. commit ()
Self. conn. close ()
Self. conn = None
Def create_table (self, filename ):
Conn = sqlite3.connect (filename)
Conn.exe cute ("create table fjsen (id integer primary key autoincrement, title text, link text, addtime text )""")
Conn. commit ()
Return conn

I will not explain it here for the time being. Let the spider run it first.

Step 4: Modify the setting. py file: Add the following sentence.
Copy codeThe Code is as follows:
ITEM_PIPELINES = ['fjsen. pipelines. FjsenPipeline ']

Run the following command:
Copy codeThe Code is as follows:
Scrapy crawl fjsen

A database file of data. sqlite will be generated at present, and all the captured data will exist here.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use scrapy to implement website crawling examples and web crawler (SPIDER) Steps

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use scrapy to implement website crawling examples and web crawler (SPIDER) Steps

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support