Use scrapy to implement website crawling examples and web crawler (SPIDER) Steps

Source: Internet
Author: User

Copy codeThe Code is as follows:
#! /Usr/bin/env python
#-*-Coding: UTF-8 -*-
From scrapy. contrib. spiders import crawler, Rule
From scrapy. contrib. linkextractors. sgml import SgmlLinkExtractor
From scrapy. selector import Selector

From cnbeta. items import CnbetaItem
Class CBSpider (crawler ):
Name = 'cnbeta'
Allowed_domains = ['cnbeta. com']
Start_urls = ['HTTP: // www.jb51.net']

Rules = (
Rule (SgmlLinkExtractor (allow = ('/articles/. * \. htm ',)),
Callback = 'parse _ page', follow = True ),
)

Def parse_page (self, response ):
Item = CnbetaItem ()
Sel = Selector (response)
Item ['title'] = sel. xpath ('// title/text ()'). extract ()
Item ['url'] = response. url
Return item


Steps for implementing SPIDER Crawlers

1. instance primary goal: capture the article list from the list page of a website and store it in the database. The database includes the article title, link, and time.

First, generate a project: scrapy startproject fjsen
First define items and open items. py:

We started modeling the project, and we wanted to capture the title, address, and time of the website. We defined the domain as these three attributes. In this way, we edit items. py and find it in the Open Directory directory. Our project looks like this:

Copy codeThe Code is as follows:
From scrapy. item import Item, Field
Class FjsenItem (Item ):
# Define the fields for your item here like:
# Name = Field ()
Title = Field ()
Link = Field ()
Addtime = Field ()

Step 2: Define a spider, which is a crawling spider (note that it is under the spiders folder of the Project). They determine the URL for a preliminary list download and how to follow the link, and how to analyze the contents of the extracted items in the page (the website we want to crawl is the link and time of all ten pages of the http://www.fjsen.com/j/node_94962.htm list ).
Create a new fjsen_spider.py with the following content:

Copy codeThe Code is as follows:
#-*-Coding: UTF-8 -*-
From scrapy. spider import BaseSpider
From scrapy. selector import HtmlXPathSelector
From fjsen. items import FjsenItem
Class FjsenSpider (BaseSpider ):
Name = "fjsen"
Allowed_domains = ["fjsen.com"]
Start_urls = ['HTTP: // www.fjsen.com/j/node_94962_'{str (x?{'.htm' for x in range ()] + ['HTTP: // then
Def parse (self, response ):
Hxs = HtmlXPathSelector (response)
Sites = hxs. select ('// ul/li ')
Items = []
For site in sites:
Item = FjsenItem ()
Item ['title'] = site. select ('A/text () '). extract ()
Item ['link'] = site. select ('A/@ href '). extract ()
Item ['addtime'] = site. select ('span/text () '). extract ()
Items. append (item)
Return items

Name: determines the name of a spider. It must be unique, that is, you cannot set different spider names.
Allowed_domains: this is obviously the allowed domain name, or the crawling range allowed by crawlers is limited to the domain name in this list.
Start_urls: A list of URLs. the SPIDER will start crawling. Therefore, the first page will be listed here for download. The subsequent URLs will generate the starting URLs that are successively included in the data. I will list ten list pages directly here.
Parse (): A method of the Spider. This function is executed when the Response object returned by each url that starts to be downloaded.
Here, I capture the data under <li> under <ul> on each list page, including the title, link, and time, and insert it into a list.


Step 3: Save the captured data to the database. You have to modify it in the pipelines. py file.
Copy codeThe Code is as follows:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
From OS import path
From scrapy import signals
From scrapy. xlib. pydispatch import dispatcher
Class FjsenPipeline (object ):

Def _ init _ (self ):
Self. conn = None
Dispatcher. connect (self. initialize, signals. engine_started)
Dispatcher. connect (self. finalize, signals. engine_stopped)
Def process_item (self, item, spider ):
Self.conn.exe cute ('insert into fjsen values (?,?,?,?) ', (None, item ['title'] [0], 'HTTP: // www.jb51.net/'{item}'link'{0},item}'addtime'{0])
Return item
Def initialize (self ):
If path. exists (self. filename ):
Self. conn = sqlite3.connect (self. filename)
Else:
Self. conn = self. create_table (self. filename)
Def finalize (self ):
If self. conn is not None:
Self. conn. commit ()
Self. conn. close ()
Self. conn = None
Def create_table (self, filename ):
Conn = sqlite3.connect (filename)
Conn.exe cute ("create table fjsen (id integer primary key autoincrement, title text, link text, addtime text )""")
Conn. commit ()
Return conn

I will not explain it here for the time being. Let the spider run it first.

Step 4: Modify the setting. py file: Add the following sentence.
Copy codeThe Code is as follows:
ITEM_PIPELINES = ['fjsen. pipelines. FjsenPipeline ']

Run the following command:
Copy codeThe Code is as follows:
Scrapy crawl fjsen

A database file of data. sqlite will be generated at present, and all the captured data will exist here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.