Using Scrapy to implement crawling Web examples and implementing web crawler (spider) Steps _python

Source: Internet
Author: User
Tags sqlite database

Copy Code code as follows:

#!/usr/bin/env python
#-*-Coding:utf-8-*-
From scrapy.contrib.spiders import crawlspider, rule
From SCRAPY.CONTRIB.LINKEXTRACTORS.SGML import Sgmllinkextractor
From Scrapy.selector import Selector

From Cnbeta.items import Cnbetaitem
Class Cbspider (Crawlspider):
name = ' Cnbeta '
Allowed_domains = [' cnbeta.com ']
Start_urls = [' http://www.jb51.net ']

Rules = (
Rule (sgmllinkextractor allow= ('/articles/.*\.htm ',)),
callback= ' Parse_page ', follow=true),
)

def parse_page (self, Response):
item = Cnbetaitem ()
SEL = Selector (response)
item[' title '] = Sel.xpath ('//title/text () '). Extract ()
item[' url '] = Response.url
Return item



Steps to realize spider crawler

1. Example primary goal: Grab a list of articles from a list page of a Web site, and then save it in a database, including the title, Link, and time of the article.

First build a project: Scrapy startproject Fjsen
First define the items, open the items.py:

We started modeling the project we wanted to crawl the title, address and time of the site, and we defined fields for these three attributes. Doing so, we edited items.py, found in the Open Directory directory. Our project looks like this:

Copy Code code as follows:

From Scrapy.item Import Item, Field
Class Fjsenitem (Item):
# define the fields for your item here:
# name = Field ()
Title=field ()
Link=field ()
Addtime=field ()

Step two: Define a spider, crawling spider (Note in the Engineering Spiders folder), they determine a preliminary list of URLs to download, how to follow the links, and how to analyze the content of the page to extract items (we're going to crawl the site is http:// Www.fjsen.com/j/node_94962.htm This list of all 10 pages of links and time).
Create a new fjsen_spider.py that reads as follows:

Copy Code code as follows:

#-*-Coding:utf-8-*-
From Scrapy.spider import Basespider
From Scrapy.selector import Htmlxpathselector
From Fjsen.items import Fjsenitem
Class Fjsenspider (Basespider):
Name= "Fjsen"
allowed_domains=["Fjsen.com"]
start_urls=[' http://www.fjsen.com/j/node_94962_ ' +str (x) + '. htm ' for x in range (2,11)]+[' Http://www.fjsen.com/j/node _94962.htm ']
Def parse (self,response):
Hxs=htmlxpathselector (response)
Sites=hxs.select ('//ul/li ')
Items=[]
For site in sites:
Item=fjsenitem ()
item[' title ']=site.select (' A/text () '). Extract ()
item[' link ' = Site.select (' A/@href '). Extract ()
item[' Addtime ']=site.select (' Span/text () '). Extract ()
Items.append (item)
return items

Name: is to determine the names of spiders. It must be unique, that is to say, you cannot set the same name for different spiders.
Allowed_domains: This is obviously the domain name that is allowed, or the range that the crawler allows to crawl is limited to the domain name in the list.
Start_urls: is a list of URLs, spiders will start crawling. So the first page will be listed here for download. Subsequent URLs will generate the starting URLs that are included in the data. I am here directly listing 10 list pages.
Parse (): Is a spider method, when each start of the URL returned by the response object will execute the function.
In this, I crawl the data under <li> in each list page, including title, links, and time, and insert into a list of the <ul>


The third step, the data will be crawled into the database, here in pipelines.py this file to modify the

Copy Code code as follows:

# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
From OS import path
From scrapy Import signals
From Scrapy.xlib.pydispatch Import Dispatcher
Class Fjsenpipeline (object):

def __init__ (self):
Self.conn=none
Dispatcher.connect (self.initialize,signals.engine_started)
Dispatcher.connect (self.finalize,signals.engine_stopped)
def process_item (Self,item,spider):
Self.conn.execute (' INSERT into Fjsen values (?,?,?,?) ', (none,item[' title '][0], ' http://www.jb51.net/' +item[' link '] [ 0],item[' Addtime '][0])
Return item
def initialize (self):
If Path.exists (self.filename):
Self.conn=sqlite3.connect (Self.filename)
Else
Self.conn=self.create_table (Self.filename)
def finalize (self):
If Self.conn is not None:
Self.conn.commit ()
Self.conn.close ()
Self.conn=none
def create_table (self,filename):
Conn=sqlite3.connect (filename)
Conn.execute ("" "" "CREATE table Fjsen (ID integer primary key autoincrement,title text,link text)" ""
Conn.commit ()
Return conn

Here I temporarily do not explain, go ahead, let this spider run up again.

Step Fourth: Modify setting.py This file: Add the following sentence

Copy Code code as follows:

item_pipelines=[' Fjsen.pipelines.FjsenPipeline ']

Then, run up, execute:

Copy Code code as follows:

Scrapy Crawl Fjsen

will now generate a Data.sqlite database file, all crawled to the data will exist here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.