Scrapy Detailed example-crawl Baidu bar data and save to the file and and database __ database

Source: Internet
Author: User
Tags instance method xpath

Scrapy is an application framework for crawling Web site data and extracting structured data. It can be applied in a series of programs including data mining, information processing or storing historical data. Using frames to crawl data that can save a lot of energy, such as do not need to download their own pages, data processing we do not have to write. We only need to focus on the data crawl rules on the line, scrapy in the Python data crawl frame data is more popular, then today with scrapy Baidu Bar-Black intermediary paste data crawling. Don't ask me why I climbed the black intermediary, because I have experienced a personal experience. Cough cough, to catch the key, let's talk about how to climb the data bar (Zangguan fierce in the tiger. )。

Note: You will need to install the Python and scrapy frames yourself first 1, create the project

Scrapy Startproject Custom Project name

Scrapy Startproject Baidutieba

This command will create a sqc_scapy directory that contains the following:

baidutieba/
    scrapy.cfg
    baidutieba/
        __init__.py
        items.py
        pipelines.py spiders/
            __init__.py
            ...
SCRAPY.CFG: Project configuration file
baidutieba/: The Python module for this project. You will then join the code here.
baidutieba/items.py: Item file in the project.
baidutieba/pipelines.py: Pipelines files in the project.
baidutieba/settings.py: The project's settings file.
baidutieba/spiders/: The directory where the spider code is placed.

2. Create crawler files

We're going to write a crawler, first of all, to create a spider
We create a file myspider.py in the baidutieba/spiders/directory. The file contains a Myspider class that must inherit Scrapy. Spider class. At the same time it must define three attributes:
1,-name: Used to distinguish spider. The name must be unique, and you may not set the same name for different spider.
2.-start_urls: Contains a list of URLs that spider crawled at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL.
3,-parse () is a spider method. When invoked, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (responsedata), extracting data (generating item), and generating the Request object for URLs that need further processing.

After the creation completes, the myspider.py code is as follows

#引入文件
Import Scrapy

class Myspider (scrapy. Spider):
    #用于区别Spider
    name = "Myspider"
    #允许访问的域
    allowed_domains = []
    #爬取的地址
    start_urls = []
    #爬取方法
    def parse (self, Response):
        Pass

3. Define Item

The main goal of crawling is to extract structured data from unstructured data sources, such as Web pages. Scrapy provides the Item class to meet this requirement.
The Item object is a simple container that holds the crawled data. It provides a dictionary-like (dictionary-like) API and a simple syntax for declaring available fields.

Here, let's make sure that the data element you want to crawl

You can see that we can see an items file in the engineering directory, we can change this file or create a new file to define our item.
Here, we create a new item file on the same layer tbitems.py:

#-*-Coding:utf-8-*-

# Define Here's models for your scraped items # to documentation in
:
# http:/ /doc.scrapy.org/en/latest/topics/items.html

Import Scrapy

class Tbitem (scrapy. Item):
    # define the fields for your item here like:
    # name = Scrapy. Field ()
    #内容
	user_info = scrapy. Field ()
	title = Scrapy. Field ()
	url = scrapy. Field ()
	short_content = scrapy. Field ()
	IMGs = scrapy. Field ()


As above: We set up a tbitems container to hold the information to capture, user_info corresponding information, title posts, URL Post details address, Short_content post Brief introduction, IMGs post pictures

Common methods are as follows:

#定义一个item
info= tbitem ()
#赋值
info[' title '] = "language"
#取值
info[' title ']
info.get (' Title ')
#获取全部键
info.keys ()
#获取全部值
Info.items ()

4, perfect my crawler main program 1:

# coding=utf-8
# 
Import scrapy from
Baidutieba. Tbitems Import Tbitem


class Myspider (scrapy. Spider):
	name = "Myspider"
	allowed_domains = [' tieba.baidu.com ']
	start_urls = [' https://tieba.baidu.com/ F?ie=utf-8&kw=%e9%bb%91%e4%b8%ad%e4%bb%8b&fr=search ']

	def parse (self, Response):
		item = Tbitem ()
		boxs = Response.xpath ("//li[contains (@class, ' j_thread_list ')]" for
		box in Boxs:
			item[' user_info '] = Box.xpath ('./@data-field '). Extract () [0];
			item[' title '] = Box.xpath (".//div[contains (@class, ' Threadlist_title ')]/a/text ()"). Extract () [0];
			item[' url ' = Box.xpath (".//div[contains (@class, ' threadlist_title ')]/a/@href"). Extract () [0];
			item[' short_content '] = Box.xpath (".//div[contains (@class, ' Threadlist_abs ')]/text ()"). Extract () [0];
			If Box.xpath ('.//img/@src '):
				item[' imgs '] = Box.xpath ('.//img/@src '). Extract () [0];
			else:
				item[' IMGs '] =[]
			yield item
Note: Here is an XPath way to get the page information, not too much introduction, you can refer to the online XPath tutorial to learn


Above this is the use of google Browser Extension Component Xpath-helper Debug component address: xpath-helper_v2.0.2, of course, Google Browser with the access to the element XPath path is as follows:



It is noted that the crawled part is performed in the parse () method of the Myspider class.
The parse () method is responsible for processing response and returning processed data

The method and other request callback functions must return an object that contains the request and/or an iteration of the item (yield item specifically, see the full understanding of yield in Python)

(in the Scrapy framework, you can use a variety of selectors to find information, where XPath is used, and we can use extensions like Beautifulsoup,lxml to choose from, and the framework itself provides a set of mechanisms to help users get information. Is selectors. Because this article is just for getting started so don't explain too much. )

CD into the Engineering folder, and then run the command line

Scrapy crawl own definition of spidername

Scrapy Crawl Myspider

Look at it to see that we have run successfully and get the data. But then everyone runs and you can see that we've only crawled a page of data, so we want to crawl all the paging data.

Def parse (self, Response):
		item = Tbitem ()
		boxs = Response.xpath ("//li[contains (@class, ' j_thread_list ')]" For
		box in Boxs:
			item[' user_info '] = Box.xpath ('./@data-field '). Extract () [0];
			item[' title '] = Box.xpath (".//div[contains (@class, ' Threadlist_title ')]/a/text ()"). Extract () [0];
			item[' url ' = Box.xpath (".//div[contains (@class, ' threadlist_title ')]/a/@href"). Extract () [0];
			item[' short_content '] = Box.xpath (".//div[contains (@class, ' Threadlist_abs ')]/text ()"). Extract () [0];
			If Box.xpath ('.//img/@src '):
				item[' imgs '] = Box.xpath ('.//img/@src '). Extract () [0];
			else:
				item[' IMGs '] =[]
			yield item

		#url跟进开始
		#获取下一页的url信息
		url = Response.xpath ('//*[@id = " Frs_list_pager "]/a[10]/@href"). Extract ()
		
		if URL:
			page = ' https: ' + url[0]
			#返回url
        	yield scrapy. Request (page, callback=self.parse)
        #url跟进结束


You can see the URL follow up and a for sibling that is to say, after the for loop completes (that is, after the page data is crawled), the next page is crawled, the address of the next page button is retrieved, and the iteration data is returned as a request so that you can crawl the paging data.


5, the crawl data to save

When item is collected in Spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order.
Each item Pipeline component (sometimes called "Item Pipeline") is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also decide whether the item continues to be pipeline or discarded and no longer processed.
Here are some typical applications for item pipeline:
(1) Clean up HTML data
(2) Verify crawled data (check item contains some fields)
(3) Check weight (and discard)
(4) Save crawl results to a file or database 1, save data to a file

First, we set up our baidutiebapipeline.py files in the project directory under the pipelines.py sibling directory

#-*-Coding:utf-8-*-

# Define Your item pipelines here
#
Don ' t forget to add your pipeline to the ITEM_PI Pelines setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html
#设置系统默认字符集
Import sys
Reload (SYS)
sys.setdefaultencoding (' utf-8 ')
Import codecs
import JSON from
logging import log



class Jsonwithencodingpipeline (object):
    "Save to the corresponding Class 1 in the file
       , configure 2 in the settings.py file
       , Yield the item in its own reptilian class, automatically executes '    
    def __init__ (self):
        self.file = Codecs.open (' Info.json ', ' W ', encoding= ') Utf-8 ') #保存为json文件
    def process_item (self, item, spider): Line
        = Json.dumps (Dict (item)) + "\ n" #转为json的
        Self.file.write (line) #写入文件中 return
        item
    def spider_closed (self, spider): #爬虫结束时关闭文件
        Self.file.close ()

So that our data saved to the file in the item pipeline is written, then we want to use it will need to register their own pipeline:

Under the sibling directory there is a settings.py open file to find Item_pipelines register our pipeline

Format: Project directory. Pipeline the name of the class in the pipeline.

The parameter of the back int type is the priority of the mark execution, the range 1~1000, the smaller the first execution

# Configure Item Pipelines
# http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
item_ Pipelines = {
	' Baidutieba. Baidutiebapipeline.jsonwithencodingpipeline ':
Then we'll run,

Scrapy Crawl Myspider


2, save the data to the database

Also add our database save pipeline in settings.py, and set the database to the following configuration:

# Configure Item Pipelines
# http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
item_ Pipelines = {
	' Baidutieba. Baidutiebapipeline.jsonwithencodingpipeline ': "
	baidutieba." Baidutiebapipeline.webcrawlerscrapypipeline ': The


MySQL database link operation
mysql_host = ' 127.0.0.1 '
MySQL _dbname = ' Test '         #数据库名字, please modify
mysql_user = ' Homestead '             #数据库账号, please modify 
mysql_passwd = ' secret '         Database password, please modify
mysql_port = 3306               #数据库端口, use in DBHelper



modifying baidutiebapipeline.py files

#-*-Coding:utf-8-*-# Define Your item pipelines here # Don ' t forget to add your pipeline to the Item_pipelines set
Ting # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html #设置系统默认字符集 import sys reload (SYS) Sys.setdefaultencoding (' Utf-8 ') from twisted.enterprise import adbapi import mysqldb import mysqldb.cursors Import codec S import JSON from logging import log Class Jsonwithencodingpipeline (object): "" Saved to the corresponding Class 1 in the file, in the settings. In the py file, configure 2, yield item in its own reptile, automatically executes ' Def __init__ (self): Self.file = Codecs.open (' Info.json ', ' W ', encoding= ' Utf-8 ') #保存为json文件 def process_item (self, item, spider): line = Json.dumps (Dict (item)) + "\ n" #转为js On the Self.file.write (line) #写入文件中 return item def spider_closed (self, spider): #爬虫结束时关闭文件 self.fi Le.close () class Webcrawlerscrapypipeline (object): "Save to the corresponding Class 1 in the database, configure 2 in the settings.py file, and y in the reptilian class that you implement Ield item automatically executes the ' Def __init__ ' (SELf,dbpool): Self.dbpool=dbpool ' Here the annotation uses the method of writing dead in the code to connect the thread pool, which can be read from the settings configuration file and be more flexible self.dbpool=a Dbapi.
                                          ConnectionPool (' MySQLdb ', host= ' 127.0.0.1 ',
                                          db= ' Crawlpicturesdb ', user= ' root ',
                                          Passwd= ' 123456 ', Cursorclass=mysqldb.cursors.dictcursor,        
        
    charset= ' UTF8 ', use_unicode=false) ' 
           @classmethod def from_settings (cls,settings): ' 1, @classmethod declare a class method, and for the usual we see is called an instance method. 2. The first parameter of the class method is the CLS (the abbreviation for Class), and the first parameter of the instance method is self, which represents an instance of the Class 3, which can be invoked by a class, just like C.F (), which is equivalent to the static method in Java "Dbparams" =dict (host=settings[' mysql_host '), #读取settings中的配置 db=settings[' Mysql_dbname '], user=s
  ettings[' Mysql_user '],          passwd=settings[' mysql_passwd ', charset= ' UTF8 ', #编码要加上, otherwise may appear the Chinese garbled problem cursorclass=mysqldb. Cursors. Dictcursor, Use_unicode=false, Dbpool=adbapi. ConnectionPool (' MySQLdb ', **dbparams) #** indicates that the dictionary is extended to a keyword parameter, which is equivalent to host=xxx,db=yyy ... return cls (Dbpool) #相当于dbpool付给了这个类, Self can get #pipeline默认调用 def process_item (self, item, spider): Query=self.dbpool.runinteraction (Self._cond
    
    Itional_insert,item) #调用插入的方法 query.adderrback (self._handle_error,item,spider) #调用异常处理方法 return item #写入数据库中 def _conditional_insert (self,tx,item): #print item[' name '] sql= ' insert INTO Test (Name,url) v Alues (%s,%s) "Print 3333333333333333333333 print item[" title "Params= (item[" title "].encode (' Utf-8
        '), item["url"] Tx.execute (sql,params) #错误处理方法 def _handle_error (self, failue, item, spider): print '--------------database operation exception!! -----------------' print '-------------------------------------------------------------' Print Failue 



This is an example of a primer I gave you, if you have any questions, please leave a message or DMS. In addition, due to the upgrade of Baidu Bar, may be the program to crawl rules will be adjusted, but the subject will not change oh, we need to adjust their own program OH

Program Completion time: 2017.7.18

Program code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.