Scrapy crawls its own blog content and scrapy crawls its blog

Source: Internet
Author: User

Scrapy crawls its own blog content and scrapy crawls its blog

The libraries commonly used for writing crawlers in python include urllib2 and requests. These libraries can be used for most simple scenarios or for learning purposes. Here is an example of how to capture popular Baidu music songs using urllib2 + BeautifulSoup I have previously written. If you are interested, take a look.

This article describes how to use Scrapy to capture the list of my blogs in the blog Park. It only captures the four simple fields: blog name, release date, reader volume, and comment volume, in order to use a simple example to describe the most basic usage of Scrapy.

Environment configuration instructions

Operating System: Ubuntu 14.04.2 LTS
Python: Python 2.7.6
Scrapy: Scrapy 1.0.3

Note: The version of Scrapy1.0 is different from the previous version, and the namespace of some classes has changed.

Create a project

Run the following command to create a Scrapy project:

scrapy startproject scrapy_cnblogs

After the project is created, view the directory structure of the project as follows:

Scrapy_cnblogs ── botcnblogs │ ── _ init __. py │ ── items. py # define the object │ ── pipelines for capturing content. py # process the captured item's pipeline │ ── settings. py # Here are the configuration parameters required by the crawler │ ── spiders │ ── _ init __. py ── scrapy. cfg # the configuration file of the Project, which can be ignored by default.

The directory where scrapy. cfg is located is the root directory of the project. This file is the configuration file of the project. After the project is created, you can ignore the content of this file. The content is as follows:

[settings]default = botcnblogs.settings[deploy]#url = http://localhost:6800/project = botcnblogs

In items. the py file defines the definition of the data structure abstracted from the captured webpage content. Because the blog name, release date, read volume, and comment volume are required here, the defined Item structure is as follows:

From scrapy import Item, Field # introduce Item, Fieldclass BotcnblogsItem (Item): # define the fields for your item here like: title = Field () # title publishDate = Field () # Release Date readCount = Field () # Read count commentCount = Field () # comment count

In pipelines. py, the information crawled by the crawler (the information here is the Item object defined above) is processed. The typical application scenarios officially introduced are:

  • Clear HTML data
  • Verify the crawled data (check that the item contains certain fields)
  • Re-query (and discard)
  • Save the crawling result to the database.

Its definition is also very simple. You only need to implement the process_item method. This method has two parameters: item, the Item object to be processed, and spider, the crawler.

There are also open_spider and close_spider methods, which are the callback methods at the crawler startup and end.

In this example, the processing is very simple. It only writes the received Item object to a json file and opens or creates an item in the _ init _ method in the form of "w +. json file, and then deserialize the object into a string and write it to the item. json file. The Code is as follows:

#-*-Coding: UTF-8-*-import jsonclass BotcnblogsPipeline (object): def _ init _ (self): self. file = open ("item. json "," w + ") def process_item (self, item, spider): record = json. dumps (dict (item), ensure_ascii = False) + "\ n" # If Chinese characters exist, add the ensure_ascii = False parameter. Otherwise, garbled self may occur. file. write (record) return item def open_spider (self, spider): pass def close_spider (self, spider): self. file. close ()

Setting. py is the configuration file of the crawler. It configures some configuration information of the crawler. Here, you need to set the ITEM_PIPELINES parameter of pipelines. This parameter configures the pipeline enabled in the project and its execution sequence, it exists as a dictionary. {"pipeline": execution order integer}

The configuration in this example is as follows:

SPIDER_MODULES = ['botcnblogs.spiders']NEWSPIDER_MODULE = 'botcnblogs.spiders'ITEM_PIPELINES = {   'botcnblogs.pipelines.BotcnblogsPipeline': 1,}

The preparation is complete. What about crawlers? Where is the crawler implemented? We can see that there is a spiders directory in the project with only one init. the py file is correct. You need to create the crawler file by yourself. In this directory, create a botspider. in this example, the crawler class defined in this example is inherited from the crawler class.

To define a Spider, the following variables and methods are required:

Name: defines the spider name. This name should be unique and must be used when executing this crawler.

Allowed_domains: List of domain names that can be crawled. For example, if you want to crawl a blog site, you need to write cnblogs.com

Start_urls: the list of entry addresses that crawlers first crawl.

Rules: If the page to be crawled is not a single page or several pages, but has certain rules to follow. For example, if the crawled blog has multiple consecutive pages, you can set them here, if you have defined a rules, you need to define your own crawler rules (in the form of a regular expression) and customize the callback function.

Code statement:

#-*-Coding: UTF-8-*-_ author _ = 'linuxfengzheng' from scrapy. spiders import Spider, Rulefrom scrapy. selector import Selectorfrom botcnblogs. items import BotcnblogsItemfrom scrapy. linkextractors import LinkExtractorimport refrom scrapy. spiders import crawler lspiderclass botspider (crawler): name = "cnblogsSpider" # Set crawler name allowed_domains = ["cnblogs.com"] # Set allowed domain names start_urls = ["http://www.cnblogs. Com/fengzheng/default.html? Page = 3 ", # set to start crawling the page] rules = (Rule (LinkExtractor (allow = ('fengzheng/default.html \? Page \ = ([\ d] +) ',), callback = 'parse _ item', follow = True),) # define the rule def parse_item (self, response): sel = response. selector posts = sel. xpath ('// div [@ id = "mainContent"]/div [@ class = "day"]') items = [] for p in posts: # content = p. extract () # self. file. write (content. encode ("UTF-8") item = BotcnblogsItem () publishDate = p. xpath ('div [@ class = "dayTitle"]/a/text ()'). extract_first () item ["publishDate"] = (publishDate is not None and [publishDate. encode ("UTF-8")] or [""]) [0] # self. file. write (title. encode ("UTF-8") title = p. xpath ('div [@ class = "postTitle"]/a/text ()'). extract_first () item ["title"] = (title is not None and [title. encode ("UTF-8")] or [""]) [0] # re_first ("posted @ 2015-11-03 wind attitude reading (\ d +") readcount = p. xpath ('div [@ class = "postDesc"]/text ()'). re_first (u "read \ (\ d + \)") regReadCount = re. search (r "\ d +", readcount) if regReadCount is not None: readcount = regReadCount. group () item ["readCount"] = (readcount is not None and [readcount. encode ("UTF-8")] or [0]) [0] commentcount = p. xpath ('div [@ class = "postDesc"]/text ()'). re_first (u "comment \ (\ d + \)") regCommentCount = re. search (r "\ d +", commentcount) if regCommentCount is not None: commentcount = regCommentCount. group () item ["commentCount"] = (commentcount is not None and [commentcount. encode ("UTF-8")] or [0]) [0] items. append (item) return items # self. file. close ()

Because version 1.0 and earlier versions have changed in the package, the differences between different versions involved in this example are listed here.

Class Package
Version 1.0 Previous versions
Spider
scrapy.spiders
Scrapy. spider
CrawlSpider
scrapy.spiders
Scrapy. contrib. spiders
LinkExtractor
scrapy.linkextractors
scrapy.contrib.linkextractors
Rule
Scrapy. spiders Scrapy. contrib. spiders

 

 

 

 

 

 

 

Crawler ideas:

Go to my blog page first. Why? Page = 2, from which we can see that the website uses page as a parameter to represent the number of pages. In this way, it seems that the crawler rules are very simple. fengzheng/default.html \? Page \ =. Of course, if the number of pages is very small, you can list all the pages to be crawled in the start_urls list. However, when the number of blog posts increases, the following problems may occur:

 start_urls = [        "http://www.cnblogs.com/fengzheng/default.html?page=1",        "http://www.cnblogs.com/fengzheng/default.html?page=2",        "http://www.cnblogs.com/fengzheng/default.html?page=3",    ]

When a crawled webpage has a rule definition, it cannot be used to inherit the crawler class from the crawler class. When the rule definition (rules) is used, if you want to process the crawled webpage, instead of simply requiring a Url, You need to define a callback function that is called when crawling a qualified webpage and set follow = Ture. The definition is as follows:

rules = (        Rule(LinkExtractor(allow=('fengzheng/default.html\?page\=([\d]+)', ),),callback='parse_item',follow=True),    )

The callback function is named parse_item. In the parse_item method, it is time to analyze the html page and obtain the required content. Observe the page to view the location of the required information ,:

Then, analyze the source code of the web page and analyze the xpath

 

Use the following code to find all the divs whose class is day. Each of them is a blog area:

posts = sel.xpath('//div[@id="mainContent"]/div/div[@class="day"]')

Next, traverse the set to obtain the required content. Pay attention to the following points:

  • The obtained content must be encoded with encode ("UTF-8") because it contains Chinese content.
  • Because the number of comments and the amount of reading are mixed, You need to extract the string using a regular expression.

Now, the simple crawler has been completed. Next, run the crawler. Run cd to go to the directory where the crawler project is located and run the following command:

scrapy crawl cnblogsSpider

Will output the crawling Process Information

Then we can see that there is an item. json file in the root directory, cat the content of this file, and we can see that the information has been extracted:

  

Click here to get the source code from github

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.