Use Scrapy to implement crawl site examples and implement web crawler (spider) steps

Last Update:2016-06-06 Source: Internet

Author: User

Tags sqlite database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The code is as follows:

#!/usr/bin/env python
#-*-Coding:utf-8-*-
From scrapy.contrib.spiders import Crawlspider, Rule
From SCRAPY.CONTRIB.LINKEXTRACTORS.SGML import Sgmllinkextractor
From Scrapy.selector import Selector

From Cnbeta.items import Cnbetaitem
Class Cbspider (Crawlspider):
name = ' Cnbeta '
Allowed_domains = [' cnbeta.com ']
Start_urls = [' http://www.bitsCN.com ']

Rules = (
Rule (Sgmllinkextractor (allow= ('/articles/.*\.htm ',)),
callback= ' Parse_page ', follow=true),
)

def parse_page (self, Response):
item = Cnbetaitem ()
SEL = Selector (response)
item[' title '] = Sel.xpath ('//title/text () '). Extract ()
item[' url '] = Response.url
Return item

Implement Spider crawler steps

1. Example primary goal: Grab a list of articles from a list page of a Web site and then save it in a database, including the article title, Link, time

First build a project: Scrapy startproject Fjsen
First define the following items to open the items.py:

We started modeling the project, we want to crawl the title, address and time of the site, we define the domain for these three attributes. To do this, we edit the items.py, found in the Open Directory directory. Our project looks like this:

The code is as follows:

From Scrapy.item Import Item, Field
Class Fjsenitem (Item):
# define the fields for your item here is like:
# name = Field ()
Title=field ()
Link=field ()
Addtime=field ()

Step two: Define a spider that is crawling spiders (note in the Spiders folder of the project), they determine an initial list of URLs to download, how to follow the links, and how to analyze the content of the page to extract the items (the site that we are crawling is http:// Www.fjsen.com/j/node_94962.htm This list of all 10 pages of links and times).
Create a new fjsen_spider.py with the following content:

The code is as follows:

#-*-Coding:utf-8-*-
From Scrapy.spider import Basespider
From Scrapy.selector import Htmlxpathselector
From Fjsen.items import Fjsenitem
Class Fjsenspider (Basespider):
Name= "Fjsen"
allowed_domains=["Fjsen.com"]
start_urls=[' http://www.fjsen.com/j/node_94962_ ' +str (x) + '. htm ' for x in range (2,11)]+[' Http://www.fjsen.com/j/node _94962.htm ']
Def parse (self,response):
Hxs=htmlxpathselector (response)
Sites=hxs.select ('//ul/li ')
Items=[]
For site in sites:
Item=fjsenitem ()
item[' title ']=site.select (' A/text () '). Extract ()
item[' link ' = Site.select (' A ' @href '). Extract ()
item[' Addtime ']=site.select (' Span/text () '). Extract ()
Items.append (item)
return items

Name: is to determine the names of spiders. It must be unique, that is, you cannot set the same name to different spiders.
Allowed_domains: This is obviously the allowed domain name, or the scope of the crawler's allowed crawl is limited to the domain name inside the list.
Start_urls: is a list of URLs, spiders will start crawling. So, the first page will be listed here for download. The URL that follows will generate the starting URL that is included in the data successively. I am here directly listing 10 list pages.
Parse (): is a method of a spider that executes the function when each of the response objects returned by the downloaded URL is executed.
Inside this, I crawl every list page in the

Under the data, including title, Link, and time, and insert into a list

In the third step, the captured data is stored in the database, which is changed in the pipelines.py file.
The code is as follows:

# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
From OS import path
From scrapy Import signals
From Scrapy.xlib.pydispatch Import Dispatcher
Class Fjsenpipeline (object):

def __init__ (self):
Self.conn=none
Dispatcher.connect (self.initialize,signals.engine_started)
Dispatcher.connect (self.finalize,signals.engine_stopped)
def process_item (Self,item,spider):
Self.conn.execute (' INSERT into Fjsen values (?,?,?,?) ', (none,item[' title '][0], ' http://www.bitsCN.com/' +item[' link ' ][0],item[' Addtime '][0]))
Return item
def initialize (self):
If Path.exists (self.filename):
Self.conn=sqlite3.connect (Self.filename)
Else
Self.conn=self.create_table (Self.filename)
def finalize (self):
If Self.conn is not None:
Self.conn.commit ()
Self.conn.close ()
Self.conn=none
def create_table (self,filename):
Conn=sqlite3.connect (filename)
Conn.execute ("" "CREATE table Fjsen (ID integer primary key autoincrement,title text,link text,addtime text)" "
Conn.commit ()
Return conn
I do not explain here for the time being, go ahead, let the spider run up again.
Fourth Step: Modify setting.py This file: Add the following sentence
The code is as follows:

item_pipelines=[' Fjsen.pipelines.FjsenPipeline ']
Then, run up, execute:
The code is as follows:

Scrapy Crawl Fjsen

A data.sqlite database file will be generated at the moment, and all the data fetched will exist here.


This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More