Use Scrapy to implement crawl site examples and implement web crawler (spider) steps

Source: Internet
Author: User
Tags sqlite database
The code is as follows:


#!/usr/bin/env python
#-*-Coding:utf-8-*-
From scrapy.contrib.spiders import Crawlspider, Rule
From SCRAPY.CONTRIB.LINKEXTRACTORS.SGML import Sgmllinkextractor
From Scrapy.selector import Selector

From Cnbeta.items import Cnbetaitem
Class Cbspider (Crawlspider):
name = ' Cnbeta '
Allowed_domains = [' cnbeta.com ']
Start_urls = [' http://www.bitsCN.com ']

Rules = (
Rule (Sgmllinkextractor (allow= ('/articles/.*\.htm ',)),
callback= ' Parse_page ', follow=true),
)

def parse_page (self, Response):
item = Cnbetaitem ()
SEL = Selector (response)
item[' title '] = Sel.xpath ('//title/text () '). Extract ()
item[' url '] = Response.url
Return item



Implement Spider crawler steps

1. Example primary goal: Grab a list of articles from a list page of a Web site and then save it in a database, including the article title, Link, time

First build a project: Scrapy startproject Fjsen
First define the following items to open the items.py:

We started modeling the project, we want to crawl the title, address and time of the site, we define the domain for these three attributes. To do this, we edit the items.py, found in the Open Directory directory. Our project looks like this:

The code is as follows:


From Scrapy.item Import Item, Field
Class Fjsenitem (Item):
# define the fields for your item here is like:
# name = Field ()
Title=field ()
Link=field ()
Addtime=field ()

Step two: Define a spider that is crawling spiders (note in the Spiders folder of the project), they determine an initial list of URLs to download, how to follow the links, and how to analyze the content of the page to extract the items (the site that we are crawling is http:// Www.fjsen.com/j/node_94962.htm This list of all 10 pages of links and times).
Create a new fjsen_spider.py with the following content:

The code is as follows:


#-*-Coding:utf-8-*-
From Scrapy.spider import Basespider
From Scrapy.selector import Htmlxpathselector
From Fjsen.items import Fjsenitem
Class Fjsenspider (Basespider):
Name= "Fjsen"
allowed_domains=["Fjsen.com"]
start_urls=[' http://www.fjsen.com/j/node_94962_ ' +str (x) + '. htm ' for x in range (2,11)]+[' Http://www.fjsen.com/j/node _94962.htm ']
Def parse (self,response):
Hxs=htmlxpathselector (response)
Sites=hxs.select ('//ul/li ')
Items=[]
For site in sites:
Item=fjsenitem ()
item[' title ']=site.select (' A/text () '). Extract ()
item[' link ' = Site.select (' A ' @href '). Extract ()
item[' Addtime ']=site.select (' Span/text () '). Extract ()
Items.append (item)
return items

Name: is to determine the names of spiders. It must be unique, that is, you cannot set the same name to different spiders.
Allowed_domains: This is obviously the allowed domain name, or the scope of the crawler's allowed crawl is limited to the domain name inside the list.
Start_urls: is a list of URLs, spiders will start crawling. So, the first page will be listed here for download. The URL that follows will generate the starting URL that is included in the data successively. I am here directly listing 10 list pages.
Parse (): is a method of a spider that executes the function when each of the response objects returned by the downloaded URL is executed.
Inside this, I crawl every list page in the

    Under the
  • Under the data, including title, Link, and time, and insert into a list


    In the third step, the captured data is stored in the database, which is changed in the pipelines.py file.

    The code is as follows:


    # Define your item pipelines here
    #
    # Don ' t forget to add your pipeline to the Item_pipelines setting
    From OS import path
    From scrapy Import signals
    From Scrapy.xlib.pydispatch Import Dispatcher
    Class Fjsenpipeline (object):

    def __init__ (self):
    Self.conn=none
    Dispatcher.connect (self.initialize,signals.engine_started)
    Dispatcher.connect (self.finalize,signals.engine_stopped)
    def process_item (Self,item,spider):
    Self.conn.execute (' INSERT into Fjsen values (?,?,?,?) ', (none,item[' title '][0], ' http://www.bitsCN.com/' +item[' link ' ][0],item[' Addtime '][0]))
    Return item
    def initialize (self):
    If Path.exists (self.filename):
    Self.conn=sqlite3.connect (Self.filename)
    Else
    Self.conn=self.create_table (Self.filename)
    def finalize (self):
    If Self.conn is not None:
    Self.conn.commit ()
    Self.conn.close ()
    Self.conn=none
    def create_table (self,filename):
    Conn=sqlite3.connect (filename)
    Conn.execute ("" "CREATE table Fjsen (ID integer primary key autoincrement,title text,link text,addtime text)" "
    Conn.commit ()
    Return conn

    I do not explain here for the time being, go ahead, let the spider run up again.

    Fourth Step: Modify setting.py This file: Add the following sentence

    The code is as follows:


    item_pipelines=[' Fjsen.pipelines.FjsenPipeline ']

    Then, run up, execute:

    The code is as follows:


    Scrapy Crawl Fjsen


    A data.sqlite database file will be generated at the moment, and all the data fetched will exist here.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.