Using Scrapy to implement crawling Web examples and implementing web crawler (spider) Steps

Using Scrapy to implement crawling Web examples and implementing web crawler (spider) Steps _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags sqlite database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Copy Code code as follows:

#!/usr/bin/env python
#-*-Coding:utf-8-*-
From scrapy.contrib.spiders import crawlspider, rule
From SCRAPY.CONTRIB.LINKEXTRACTORS.SGML import Sgmllinkextractor
From Scrapy.selector import Selector

From Cnbeta.items import Cnbetaitem
Class Cbspider (Crawlspider):
name = ' Cnbeta '
Allowed_domains = [' cnbeta.com ']
Start_urls = [' http://www.jb51.net ']

Rules = (
Rule (sgmllinkextractor allow= ('/articles/.*\.htm ',)),
callback= ' Parse_page ', follow=true),
)

def parse_page (self, Response):
item = Cnbetaitem ()
SEL = Selector (response)
item[' title '] = Sel.xpath ('//title/text () '). Extract ()
item[' url '] = Response.url
Return item

Steps to realize spider crawler

1. Example primary goal: Grab a list of articles from a list page of a Web site, and then save it in a database, including the title, Link, and time of the article.

First build a project: Scrapy startproject Fjsen
First define the items, open the items.py:

We started modeling the project we wanted to crawl the title, address and time of the site, and we defined fields for these three attributes. Doing so, we edited items.py, found in the Open Directory directory. Our project looks like this:

Copy Code code as follows:

From Scrapy.item Import Item, Field
Class Fjsenitem (Item):
# define the fields for your item here:
# name = Field ()
Title=field ()
Link=field ()
Addtime=field ()

Step two: Define a spider, crawling spider (Note in the Engineering Spiders folder), they determine a preliminary list of URLs to download, how to follow the links, and how to analyze the content of the page to extract items (we're going to crawl the site is http:// Www.fjsen.com/j/node_94962.htm This list of all 10 pages of links and time).
Create a new fjsen_spider.py that reads as follows:

Copy Code code as follows:

#-*-Coding:utf-8-*-
From Scrapy.spider import Basespider
From Scrapy.selector import Htmlxpathselector
From Fjsen.items import Fjsenitem
Class Fjsenspider (Basespider):
Name= "Fjsen"
allowed_domains=["Fjsen.com"]
start_urls=[' http://www.fjsen.com/j/node_94962_ ' +str (x) + '. htm ' for x in range (2,11)]+[' Http://www.fjsen.com/j/node _94962.htm ']
Def parse (self,response):
Hxs=htmlxpathselector (response)
Sites=hxs.select ('//ul/li ')
Items=[]
For site in sites:
Item=fjsenitem ()
item[' title ']=site.select (' A/text () '). Extract ()
item[' link ' = Site.select (' A/@href '). Extract ()
item[' Addtime ']=site.select (' Span/text () '). Extract ()
Items.append (item)
return items

Name: is to determine the names of spiders. It must be unique, that is to say, you cannot set the same name for different spiders.
Allowed_domains: This is obviously the domain name that is allowed, or the range that the crawler allows to crawl is limited to the domain name in the list.
Start_urls: is a list of URLs, spiders will start crawling. So the first page will be listed here for download. Subsequent URLs will generate the starting URLs that are included in the data. I am here directly listing 10 list pages.
Parse (): Is a spider method, when each start of the URL returned by the response object will execute the function.
In this, I crawl the data under <li> in each list page, including title, links, and time, and insert into a list of the <ul>

The third step, the data will be crawled into the database, here in pipelines.py this file to modify the

Copy Code code as follows:

# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
From OS import path
From scrapy Import signals
From Scrapy.xlib.pydispatch Import Dispatcher
Class Fjsenpipeline (object):

def __init__ (self):
Self.conn=none
Dispatcher.connect (self.initialize,signals.engine_started)
Dispatcher.connect (self.finalize,signals.engine_stopped)
def process_item (Self,item,spider):
Self.conn.execute (' INSERT into Fjsen values (?,?,?,?) ', (none,item[' title '][0], ' http://www.jb51.net/' +item[' link '] [ 0],item[' Addtime '][0])
Return item
def initialize (self):
If Path.exists (self.filename):
Self.conn=sqlite3.connect (Self.filename)
Else
Self.conn=self.create_table (Self.filename)
def finalize (self):
If Self.conn is not None:
Self.conn.commit ()
Self.conn.close ()
Self.conn=none
def create_table (self,filename):
Conn=sqlite3.connect (filename)
Conn.execute ("" "" "CREATE table Fjsen (ID integer primary key autoincrement,title text,link text)" ""
Conn.commit ()
Return conn

Here I temporarily do not explain, go ahead, let this spider run up again.

Step Fourth: Modify setting.py This file: Add the following sentence

Copy Code code as follows:

item_pipelines=[' Fjsen.pipelines.FjsenPipeline ']

Then, run up, execute:

Copy Code code as follows:

Scrapy Crawl Fjsen

will now generate a Data.sqlite database file, all crawled to the data will exist here.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More