This chapter begins with a case study of the Python scrapy framework, for more information, see: Python Learning Guide
Getting Started case study goals
- Create a Scrapy Project
- Defining extracted structured data (Item)
- Write the spider of a crawl site and extract the Structured data (Item)
- Write item pipelines to store the extracted item (that is, structured data)
First, the new project (Scrapy Startproject)
- Before you begin a crawl, you must create a new Scrapy project. Go to the custom project directory and run the following command:
scrapy startproject cnblogSpider
- Where Cnblogspider is the project name, you can see that a Cnblogspider folder will be created with the following directory structure:
SCRAPY.CFG: Project Deployment Files
cnblogspider/: The project's Python module, where you can then add code
cnblogspider/items.py: Item file in the project.
cnblogspider/pipelines.py: The pipelines file in the project.
cnblogspider/settings.py: The configuration file for the project.
cnblogspider/spiders/: The directory where the spider code is placed.
Ii. Clear Objectives (myspider/items.py)
We intend to crawl: "http://www.cnblogs.com/miqi1992/default.html?page=2" site blog address, title, creation time, text.
Open the items.py in the Cnblogspider directory
Item defines a structured data field that is used to hold the crawled data, somewhat like the dict in Python, but provides some additional protection from errors.
You can define an item by creating a Scrapy.item class and defining a Class attribute of type Scrapy.field (which can be understood as a mapping relationship similar to ORM).
Next, create a Cnblogspideritem class, and the model item model.
"' Python
Import Scrapy
Class Cnblogspideritem (Scrapy. Item):
# define the fields for your item here is like:
url = scrapy. Field ()
Time = Scrapy. Field ()
title = Scrapy. Field ()
Content = Scrapy. Field ()
```
Third, Production crawler (spiders/cnblogsspider.py)
The crawler function is divided into two main steps:
1. Crawling data
Enter the command in the current directory, cnblogSpider/spiders
create a crawler named under the directory cnblog
, and set the scope of the crawl domain:
"cnblogs.com"
cnblogSpider/spiders
under Open Directory cnblog
, the following code is added by default:
"' Python
--Coding:utf-8--Import Scrapy
Class Cnblogspider (Scrapy. Spider):
name = ' Cnblog '
Allowed_domains = [' cnblogs.com ']
Start_urls = [' http://cnblogs.com/']
def parse(self, response): pass
```
In fact, we can also create cnblog.py and write the above code, but the use of commands can eliminate the hassle of writing fixed code
To build a spider, you have to use Scrapy. The spider class creates a subclass and determines three mandatory properties and a method.
name = ""
: The name of the crawler must be unique and different names must be defined in different crawlers.
allow_domains=[]
: is the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler crawl only the page under this domain name, the non-existent URL will be ignored.
start_urls=()
: Crawled URL tuple/list. The crawler starts crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
parse(self, response)
: Parsing method, each initial URL completes the download will be called, when the call passed from each URL returned to the response object as the only parameter, the main role is as follows:
- Responsible for parsing the returned web page data (respose.body), extracting structured data (Generate item)
- Generate a URL request that requires the next page
change the value of Start_urls to the first URL that needs to be crawled:
start_urls=("http://www.cnblogs.com/miqi1992/default.html?page=2")
Modify the Parse () method
def parse(self, response): ="cnblog.html" withopen'w'as f: f.write(response.body)
Then run a look and run in the Cnblogspider directory:
scrapy crawl cnblog
Yes, it is cnblog, look at the code above, it is the Name property of the Cnblogspider class, which is scrapy genspider
the unique crawler name of the command.
After running, if the printed log appears [scrapy]INFO: Spider closed(finished)
, the execution is completed. After that, a cnblog.html file appears in the current folder, which is the full source code information for the page we just crawled.
#注意,Python2.x默认编码环境是ASCII,当和取回的数据编码格式不一致时,可能会造成乱码;#我们可以指定保存内容的编码格式,一般情况下,我们可以在代码最上方添加:import osreload(sys)sys.setdefaultencoding('utf-8')#这三行代码是Python2.x里面解决中文编码的万能钥匙,警告这么多年的吐槽后Python3学乖了,默认编码是Unicode了
2. Crawling data
- Crawl the entire page, the next step is to take the process, the first observation page source code:
<div "day" > <div class= "Daytitle" > ... </DIV> <div class= "Posttitle" > ... </DIV> <div class= "Postcon" > ... </DIV> </DIV>
- The XPath expression is as follows:
- All articles:.//*[@class = ' Day ')
- Article published:.//*[@class = ' daytitle ']/a/text ()
- Article title content:.//*[@class = ' posttitle ']/a/text ()
- Article Summary content:.//*[@class = ' Postcon ']/div/text ()
- Article links:.//*[@class = ' posttitle ']/a/@href
Is it a glance? Start extracting data directly on the XPath.
We previously defined a Cnblogitem class in cnblogspider/items.py. introduced here
fromimport CnblogspiderItem
We then encapsulate the data we get into an CnblogspiderItem
object that can hold the properties of each blog:
"' Python
Form Cnblogspider.items Import Cnblogspideritem
Def parse (self, Response):
# Print (response.body)
# filename = "cnblog.html"
# with open (filename, ' W ') as F:
# F.write (response.body)
#存放博客的集合 items = [] for each in response.xpath(".//*[@class='day']"): item = CnblogspiderItem() url = each.xpath('.//*[@class="postTitle"]/a/@href').extract()[0] title = each.xpath('.//*[@class="postTitle"]/a/text()').extract()[0] time = each.xpath('.//*[@class="dayTitle"]/a/text()').extract()[0] content = each.xpath('.//*[@class="postCon"]/div/text()').extract()[0] item['url'] = url item['title'] = title item['time'] = time item['content'] = content items.append(item) #直接返回最后数据 return items
```
- We will not process the pipeline for the time being, which is described in detail later.
Save data
Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:
#json格式,默认为Unicode编码-o cnblog.json#json lines格式,默认为Unicode编码-o cnblog.jsonl#csv逗号表达式,可用excel打开-o cnblog.csv#xml格式-o cnblog.xml
Consider if you change the code to the following form, the result is exactly the same. Consider the effect of yield here:
Form Cnblogspider.itemsImportCnblogspideritemdefParse Self, response):# Print (response.body) # filename = "cnblog.html" # with open (filename, ' W ') as F: # F.write (response.body) #存放博客的集合 # items = [] foreachinchResponse.xpath (".//*[@class = ' Day ']"): Item=Cnblogspideritem () URL=Each.xpath ('.//*[@class = "Posttitle"]/a/@href '). Extract () [0] Title=Each.xpath ('.//*[@class = "Posttitle"]/a/text () '). Extract () [0] Time=Each.xpath ('.//*[@class = "Daytitle"]/a/text () '). Extract () [0] Content=Each.xpath ('.//*[@class = "Postcon"]/div/text () '). Extract () [0] Item[' URL ']=URL item[' title ']=Title item[' Time ']=Time item[' content ']=Content# Items.append (item) #将获取到的数据交给pipelines yieldItem#直接返回最后数据, not through pipelines. #return Items
Reference:
- Python reference manual
Python crawler scrapy (b) _ Getting Started case