Python crawler scrapy (b) _ Getting Started case

Last Update:2017-12-26 Source: Internet

Author: User

Tags xpath python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This chapter begins with a case study of the Python scrapy framework, for more information, see: Python Learning Guide

Getting Started case study goals

Create a Scrapy Project
Defining extracted structured data (Item)
Write the spider of a crawl site and extract the Structured data (Item)
Write item pipelines to store the extracted item (that is, structured data)

First, the new project (Scrapy Startproject)

Before you begin a crawl, you must create a new Scrapy project. Go to the custom project directory and run the following command:

scrapy startproject cnblogSpider

Where Cnblogspider is the project name, you can see that a Cnblogspider folder will be created with the following directory structure:

SCRAPY.CFG: Project Deployment Files
cnblogspider/: The project's Python module, where you can then add code
cnblogspider/items.py: Item file in the project.
cnblogspider/pipelines.py: The pipelines file in the project.
cnblogspider/settings.py: The configuration file for the project.
cnblogspider/spiders/: The directory where the spider code is placed.

Ii. Clear Objectives (myspider/items.py)

We intend to crawl: "http://www.cnblogs.com/miqi1992/default.html?page=2" site blog address, title, creation time, text.

Open the items.py in the Cnblogspider directory
Item defines a structured data field that is used to hold the crawled data, somewhat like the dict in Python, but provides some additional protection from errors.
You can define an item by creating a Scrapy.item class and defining a Class attribute of type Scrapy.field (which can be understood as a mapping relationship similar to ORM).
Next, create a Cnblogspideritem class, and the model item model.
"' Python
Import Scrapy

Class Cnblogspideritem (Scrapy. Item):
# define the fields for your item here is like:
url = scrapy. Field ()
Time = Scrapy. Field ()
title = Scrapy. Field ()
Content = Scrapy. Field ()
```

Third, Production crawler (spiders/cnblogsspider.py)

The crawler function is divided into two main steps:

1. Crawling data

Enter the command in the current directory, cnblogSpider/spiders create a crawler named under the directory cnblog , and set the scope of the crawl domain:
```
"cnblogs.com"
```
cnblogSpider/spidersunder Open Directory cnblog , the following code is added by default:
"' Python
--Coding:utf-8--
Import Scrapy

Class Cnblogspider (Scrapy. Spider):
name = ' Cnblog '
Allowed_domains = [' cnblogs.com ']
Start_urls = [' http://cnblogs.com/']

def parse(self, response):    pass

```

In fact, we can also create cnblog.py and write the above code, but the use of commands can eliminate the hassle of writing fixed code

To build a spider, you have to use Scrapy. The spider class creates a subclass and determines three mandatory properties and a method.

name = "": The name of the crawler must be unique and different names must be defined in different crawlers.
allow_domains=[]: is the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler crawl only the page under this domain name, the non-existent URL will be ignored.
start_urls=(): Crawled URL tuple/list. The crawler starts crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
parse(self, response): Parsing method, each initial URL completes the download will be called, when the call passed from each URL returned to the response object as the only parameter, the main role is as follows:
1. Responsible for parsing the returned web page data (respose.body), extracting structured data (Generate item)
2. Generate a URL request that requires the next page

change the value of Start_urls to the first URL that needs to be crawled:

start_urls=("http://www.cnblogs.com/miqi1992/default.html?page=2")

Modify the Parse () method

def parse(self, response):    ="cnblog.html"    withopen'w'as f:        f.write(response.body)

Then run a look and run in the Cnblogspider directory:

scrapy crawl cnblog

Yes, it is cnblog, look at the code above, it is the Name property of the Cnblogspider class, which is scrapy genspider the unique crawler name of the command.

After running, if the printed log appears [scrapy]INFO: Spider closed(finished) , the execution is completed. After that, a cnblog.html file appears in the current folder, which is the full source code information for the page we just crawled.

#注意，Python2.x默认编码环境是ASCII，当和取回的数据编码格式不一致时，可能会造成乱码;#我们可以指定保存内容的编码格式，一般情况下，我们可以在代码最上方添加：import osreload(sys)sys.setdefaultencoding('utf-8')#这三行代码是Python2.x里面解决中文编码的万能钥匙，警告这么多年的吐槽后Python3学乖了，默认编码是Unicode了

2. Crawling data

Crawl the entire page, the next step is to take the process, the first observation page source code:

<div    "day"  >  <div   class=   "Daytitle"  >  ...    </DIV>  <div   class=   "Posttitle"  >  ...    </DIV>  <div   class=   "Postcon"  >  ... </DIV>  </DIV>

The XPath expression is as follows:
- All articles:.//*[@class = ' Day ')
- Article published:.//*[@class = ' daytitle ']/a/text ()
- Article title content:.//*[@class = ' posttitle ']/a/text ()
- Article Summary content:.//*[@class = ' Postcon ']/div/text ()
- Article links:.//*[@class = ' posttitle ']/a/@href

Is it a glance? Start extracting data directly on the XPath.

We previously defined a Cnblogitem class in cnblogspider/items.py. introduced here
```
fromimport CnblogspiderItem
```
We then encapsulate the data we get into an CnblogspiderItem object that can hold the properties of each blog:
"' Python

Form Cnblogspider.items Import Cnblogspideritem

Def parse (self, Response):
# Print (response.body)
# filename = "cnblog.html"
# with open (filename, ' W ') as F:
# F.write (response.body)

    #存放博客的集合    items = []    for each in response.xpath(".//*[@class='day']"):        item = CnblogspiderItem()        url = each.xpath('.//*[@class="postTitle"]/a/@href').extract()[0]        title = each.xpath('.//*[@class="postTitle"]/a/text()').extract()[0]        time = each.xpath('.//*[@class="dayTitle"]/a/text()').extract()[0]        content = each.xpath('.//*[@class="postCon"]/div/text()').extract()[0]        item['url'] = url        item['title'] = title        item['time'] = time        item['content'] = content                 items.append(item)    #直接返回最后数据    return items

```

We will not process the pipeline for the time being, which is described in detail later.

Save data

Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:

#json格式，默认为Unicode编码-o cnblog.json#json lines格式，默认为Unicode编码-o cnblog.jsonl#csv逗号表达式，可用excel打开-o cnblog.csv#xml格式-o cnblog.xml

Consider if you change the code to the following form, the result is exactly the same. Consider the effect of yield here:

Form Cnblogspider.itemsImportCnblogspideritemdefParse Self, response):# Print (response.body)        # filename = "cnblog.html"        # with open (filename, ' W ') as F:        # F.write (response.body)        #存放博客的集合        # items = []         foreachinchResponse.xpath (".//*[@class = ' Day ']"): Item=Cnblogspideritem () URL=Each.xpath ('.//*[@class = "Posttitle"]/a/@href '). Extract () [0] Title=Each.xpath ('.//*[@class = "Posttitle"]/a/text () '). Extract () [0] Time=Each.xpath ('.//*[@class = "Daytitle"]/a/text () '). Extract () [0] Content=Each.xpath ('.//*[@class = "Postcon"]/div/text () '). Extract () [0] Item[' URL ']=URL item[' title ']=Title item[' Time ']=Time item[' content ']=Content# Items.append (item)            #将获取到的数据交给pipelines            yieldItem#直接返回最后数据, not through pipelines.        #return Items

Reference:

Python reference manual

Python crawler scrapy (b) _ Getting Started case

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More