Copy Code code as follows:
#!/usr/bin/env python
#-*-Coding:utf-8-*-
From scrapy.contrib.spiders import crawlspider, rule
From SCRAPY.CONTRIB.LINKEXTRACTORS.SGML import Sgmllinkextractor
From Scrapy.selector import Selector
From Cnbeta.items import Cnbetaitem
Class Cbspider (Crawlspider):
name = ' Cnbeta '
Allowed_domains = [' cnbeta.com ']
Start_urls = [' http://www.jb51.net ']
Rules = (
Rule (sgmllinkextractor allow= ('/articles/.*\.htm ',)),
callback= ' Parse_page ', follow=true),
)
def parse_page (self, Response):
item = Cnbetaitem ()
SEL = Selector (response)
item[' title '] = Sel.xpath ('//title/text () '). Extract ()
item[' url '] = Response.url
Return item
Steps to realize spider crawler
1. Example primary goal: Grab a list of articles from a list page of a Web site, and then save it in a database, including the title, Link, and time of the article.
First build a project: Scrapy startproject Fjsen
First define the items, open the items.py:
We started modeling the project we wanted to crawl the title, address and time of the site, and we defined fields for these three attributes. Doing so, we edited items.py, found in the Open Directory directory. Our project looks like this:
Copy Code code as follows:
From Scrapy.item Import Item, Field
Class Fjsenitem (Item):
# define the fields for your item here:
# name = Field ()
Title=field ()
Link=field ()
Addtime=field ()
Step two: Define a spider, crawling spider (Note in the Engineering Spiders folder), they determine a preliminary list of URLs to download, how to follow the links, and how to analyze the content of the page to extract items (we're going to crawl the site is http:// Www.fjsen.com/j/node_94962.htm This list of all 10 pages of links and time).
Create a new fjsen_spider.py that reads as follows:
Copy Code code as follows:
#-*-Coding:utf-8-*-
From Scrapy.spider import Basespider
From Scrapy.selector import Htmlxpathselector
From Fjsen.items import Fjsenitem
Class Fjsenspider (Basespider):
Name= "Fjsen"
allowed_domains=["Fjsen.com"]
start_urls=[' http://www.fjsen.com/j/node_94962_ ' +str (x) + '. htm ' for x in range (2,11)]+[' Http://www.fjsen.com/j/node _94962.htm ']
Def parse (self,response):
Hxs=htmlxpathselector (response)
Sites=hxs.select ('//ul/li ')
Items=[]
For site in sites:
Item=fjsenitem ()
item[' title ']=site.select (' A/text () '). Extract ()
item[' link ' = Site.select (' A/@href '). Extract ()
item[' Addtime ']=site.select (' Span/text () '). Extract ()
Items.append (item)
return items
Name: is to determine the names of spiders. It must be unique, that is to say, you cannot set the same name for different spiders.
Allowed_domains: This is obviously the domain name that is allowed, or the range that the crawler allows to crawl is limited to the domain name in the list.
Start_urls: is a list of URLs, spiders will start crawling. So the first page will be listed here for download. Subsequent URLs will generate the starting URLs that are included in the data. I am here directly listing 10 list pages.
Parse (): Is a spider method, when each start of the URL returned by the response object will execute the function.
In this, I crawl the data under <li> in each list page, including title, links, and time, and insert into a list of the <ul>
The third step, the data will be crawled into the database, here in pipelines.py this file to modify the
Copy Code code as follows:
# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
From OS import path
From scrapy Import signals
From Scrapy.xlib.pydispatch Import Dispatcher
Class Fjsenpipeline (object):
def __init__ (self):
Self.conn=none
Dispatcher.connect (self.initialize,signals.engine_started)
Dispatcher.connect (self.finalize,signals.engine_stopped)
def process_item (Self,item,spider):
Self.conn.execute (' INSERT into Fjsen values (?,?,?,?) ', (none,item[' title '][0], ' http://www.jb51.net/' +item[' link '] [ 0],item[' Addtime '][0])
Return item
def initialize (self):
If Path.exists (self.filename):
Self.conn=sqlite3.connect (Self.filename)
Else
Self.conn=self.create_table (Self.filename)
def finalize (self):
If Self.conn is not None:
Self.conn.commit ()
Self.conn.close ()
Self.conn=none
def create_table (self,filename):
Conn=sqlite3.connect (filename)
Conn.execute ("" "" "CREATE table Fjsen (ID integer primary key autoincrement,title text,link text)" ""
Conn.commit ()
Return conn
Here I temporarily do not explain, go ahead, let this spider run up again.
Step Fourth: Modify setting.py This file: Add the following sentence
Copy Code code as follows:
item_pipelines=[' Fjsen.pipelines.FjsenPipeline ']
Then, run up, execute:
Copy Code code as follows:
will now generate a Data.sqlite database file, all crawled to the data will exist here.