Python crawler scrapy (b) _ Getting Started case

Source: Internet
Author: User
Tags xpath python scrapy

This chapter begins with a case study of the Python scrapy framework, for more information, see: Python Learning Guide

Getting Started case study goals
    • Create a Scrapy Project
    • Defining extracted structured data (Item)
    • Write the spider of a crawl site and extract the Structured data (Item)
    • Write item pipelines to store the extracted item (that is, structured data)
First, the new project (Scrapy Startproject)
    • Before you begin a crawl, you must create a new Scrapy project. Go to the custom project directory and run the following command:
scrapy startproject cnblogSpider
    • Where Cnblogspider is the project name, you can see that a Cnblogspider folder will be created with the following directory structure:

SCRAPY.CFG: Project Deployment Files
cnblogspider/: The project's Python module, where you can then add code
cnblogspider/items.py: Item file in the project.
cnblogspider/pipelines.py: The pipelines file in the project.
cnblogspider/settings.py: The configuration file for the project.
cnblogspider/spiders/: The directory where the spider code is placed.

Ii. Clear Objectives (myspider/items.py)

We intend to crawl: "http://www.cnblogs.com/miqi1992/default.html?page=2" site blog address, title, creation time, text.

    1. Open the items.py in the Cnblogspider directory

    2. Item defines a structured data field that is used to hold the crawled data, somewhat like the dict in Python, but provides some additional protection from errors.

    3. You can define an item by creating a Scrapy.item class and defining a Class attribute of type Scrapy.field (which can be understood as a mapping relationship similar to ORM).

    4. Next, create a Cnblogspideritem class, and the model item model.
      "' Python
      Import Scrapy

Class Cnblogspideritem (Scrapy. Item):
# define the fields for your item here is like:
url = scrapy. Field ()
Time = Scrapy. Field ()
title = Scrapy. Field ()
Content = Scrapy. Field ()
```

Third, Production crawler (spiders/cnblogsspider.py)

The crawler function is divided into two main steps:

1. Crawling data
    • Enter the command in the current directory, cnblogSpider/spiders create a crawler named under the directory cnblog , and set the scope of the crawl domain:

      "cnblogs.com"
    • cnblogSpider/spidersunder Open Directory cnblog , the following code is added by default:
      "' Python

      --Coding:utf-8--

      Import Scrapy

Class Cnblogspider (Scrapy. Spider):
name = ' Cnblog '
Allowed_domains = [' cnblogs.com ']
Start_urls = [' http://cnblogs.com/']

def parse(self, response):    pass

```

In fact, we can also create cnblog.py and write the above code, but the use of commands can eliminate the hassle of writing fixed code

To build a spider, you have to use Scrapy. The spider class creates a subclass and determines three mandatory properties and a method.

    • name = "": The name of the crawler must be unique and different names must be defined in different crawlers.
    • allow_domains=[]: is the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler crawl only the page under this domain name, the non-existent URL will be ignored.
    • start_urls=(): Crawled URL tuple/list. The crawler starts crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
    • parse(self, response): Parsing method, each initial URL completes the download will be called, when the call passed from each URL returned to the response object as the only parameter, the main role is as follows:
      1. Responsible for parsing the returned web page data (respose.body), extracting structured data (Generate item)
      2. Generate a URL request that requires the next page

change the value of Start_urls to the first URL that needs to be crawled:

start_urls=("http://www.cnblogs.com/miqi1992/default.html?page=2")

Modify the Parse () method

def parse(self, response):    ="cnblog.html"    withopen'w'as f:        f.write(response.body)

Then run a look and run in the Cnblogspider directory:

scrapy crawl cnblog

Yes, it is cnblog, look at the code above, it is the Name property of the Cnblogspider class, which is scrapy genspider the unique crawler name of the command.

After running, if the printed log appears [scrapy]INFO: Spider closed(finished) , the execution is completed. After that, a cnblog.html file appears in the current folder, which is the full source code information for the page we just crawled.

#注意,Python2.x默认编码环境是ASCII,当和取回的数据编码格式不一致时,可能会造成乱码;#我们可以指定保存内容的编码格式,一般情况下,我们可以在代码最上方添加:import osreload(sys)sys.setdefaultencoding('utf-8')#这三行代码是Python2.x里面解决中文编码的万能钥匙,警告这么多年的吐槽后Python3学乖了,默认编码是Unicode了
2. Crawling data
    • Crawl the entire page, the next step is to take the process, the first observation page source code:
<div    "day"  >  <div   class=   "Daytitle"  >  ...    </DIV>  <div   class=   "Posttitle"  >  ...    </DIV>  <div   class=   "Postcon"  >  ... </DIV>  </DIV>   
    • The XPath expression is as follows:
      • All articles:.//*[@class = ' Day ')
      • Article published:.//*[@class = ' daytitle ']/a/text ()
      • Article title content:.//*[@class = ' posttitle ']/a/text ()
      • Article Summary content:.//*[@class = ' Postcon ']/div/text ()
      • Article links:.//*[@class = ' posttitle ']/a/@href

Is it a glance? Start extracting data directly on the XPath.

    • We previously defined a Cnblogitem class in cnblogspider/items.py. introduced here

      fromimport CnblogspiderItem
    • We then encapsulate the data we get into an CnblogspiderItem object that can hold the properties of each blog:
      "' Python

Form Cnblogspider.items Import Cnblogspideritem

Def parse (self, Response):
# Print (response.body)
# filename = "cnblog.html"
# with open (filename, ' W ') as F:
# F.write (response.body)

    #存放博客的集合    items = []    for each in response.xpath(".//*[@class='day']"):        item = CnblogspiderItem()        url = each.xpath('.//*[@class="postTitle"]/a/@href').extract()[0]        title = each.xpath('.//*[@class="postTitle"]/a/text()').extract()[0]        time = each.xpath('.//*[@class="dayTitle"]/a/text()').extract()[0]        content = each.xpath('.//*[@class="postCon"]/div/text()').extract()[0]        item['url'] = url        item['title'] = title        item['time'] = time        item['content'] = content                 items.append(item)    #直接返回最后数据    return items

```

    • We will not process the pipeline for the time being, which is described in detail later.
Save data

Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:

#json格式,默认为Unicode编码-o cnblog.json#json lines格式,默认为Unicode编码-o cnblog.jsonl#csv逗号表达式,可用excel打开-o cnblog.csv#xml格式-o cnblog.xml
Consider if you change the code to the following form, the result is exactly the same. Consider the effect of yield here:
Form Cnblogspider.itemsImportCnblogspideritemdefParse Self, response):# Print (response.body)        # filename = "cnblog.html"        # with open (filename, ' W ') as F:        # F.write (response.body)        #存放博客的集合        # items = []         foreachinchResponse.xpath (".//*[@class = ' Day ']"): Item=Cnblogspideritem () URL=Each.xpath ('.//*[@class = "Posttitle"]/a/@href '). Extract () [0] Title=Each.xpath ('.//*[@class = "Posttitle"]/a/text () '). Extract () [0] Time=Each.xpath ('.//*[@class = "Daytitle"]/a/text () '). Extract () [0] Content=Each.xpath ('.//*[@class = "Postcon"]/div/text () '). Extract () [0] Item[' URL ']=URL item[' title ']=Title item[' Time ']=Time item[' content ']=Content# Items.append (item)            #将获取到的数据交给pipelines            yieldItem#直接返回最后数据, not through pipelines.        #return Items
Reference:
    1. Python reference manual

Python crawler scrapy (b) _ Getting Started case

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.