After the installation of Scrapy, I believe everyone will be tempted to customize a crawler it? I'm no exception, here's a detailed record of what steps are required to customize a scrapy project. If you have not installed the scrapy, or for the installation of scrapy feel headache and overwhelmed, you can refer to the previous article installed Python crawler scrapy the pits and outside of the thinking of programming. Let's take the blog park as an example, crawl the blog list and save it to a JSON file.
Environment: CentOS 6.0 Virtual machine
Scrapy (if not installed can refer to installing Python crawler scrapy trampled on those pits and thinking outside the programming)
1, create the project Cnblogs
[email protected] share]# scrapy startproject cnblogs -- .-Ten the: $:Geneva[Scrapy] Info:scrapy1.0. 0rc2 started (Bot:scrapybot) -- .-Ten the: $:Geneva[Scrapy] info:optional features Available:ssl, HTTP11 -- .-Ten the: $:Geneva[Scrapy] Info:overridden settings: {}new scrapy project'Cnblogs'Createdinch: /mnt/hgfs/share/Cnblogsyou can start your first spider with:cd cnblogs scrapy Genspider example example.com
2, view the structure of the project
[Email protected] share]# tree cnblogs/cnblogs/├──cnblogs│├──__init__.py│├──items.py #用于定义抽取网页结构 │ ├──pipelines.py #将抽取的数据进行处理 │├──settings.py #爬虫配置文件 │└──spiders│ └──__init__.py└──scrapy.cfg #项目配置文件
3. Define the structure of the webpage to extract cnblogs, modify the items.py
Here we draw four items:
- Article title
- Article links
- URL of the list page where the text is located
- Summary
[Email protected] cnblogs]#VIcnblogs/items.py#-*-coding:utf-8-*-# Define here the models foryour scraped items## See documentationinch: # http://doc.scrapy.org/en/latest/topics/items.htmlImport Scrapyclass Cnblogsitem (scrapy. Item): # define The fields foryour item Here's like: # Name=Scrapy. Field () title=Scrapy. Field () Link=Scrapy. Field () desc=Scrapy. Field () listURL=Scrapy. Field () Pass
4. Create Spider
[Email protected] cnblogs]#VIcnblogs/spiders/cnblogs_spider.py#coding=utf-8Import reimport jsonfrom scrapy.selector import selectortry:from scrapy.spider import Spiderexcept:from scrapy . Spider import Basespider as Spiderfrom scrapy.utils.response import get_base_urlfrom scrapy.utils.url import Urljoin_ Rfcfrom scrapy.contrib.spiders import Crawlspider, rulefrom SCRAPY.CONTRIB.LINKEXTRACTORS.SGML Import Sgmllinkextractor as Slefrom cnblogs.items import*class Cnblogsspider (crawlspider): #定义爬虫的名称 name="Cnblogsspider"#定义允许抓取的域名, if the domain name is not in this list, discard the crawl allowed_domains= ["cnblogs.com"] #定义抓取的入口url start_urls= [ "http://www.cnblogs.com/rwxwsblog/default.html?page=1"] # defines the rules for crawling URLs and specifies that the callback function is Parse_item rules=[Rule (SLE ( allow=("/rwxwsblog/default.html\?page=\d{1,}"), #此处要注意 the conversion of the number, copy over to need to?the number is escaped. Follow=True, Callback='Parse_item')] #print"**********cnblogsspider**********"#定义回调函数 #提取数据到Items里面, mainly using XPath and CSS selectors to extract Web page data def parse_item (self, Response): #print"-----------------"Items=[] Sel=Selector (response) Base_url=Get_base_url (response) Posttitle= Sel.css ('Div.day Div.posttitle') #print"=============length======="Postcon= Sel.css ('Div.postcon Div.c_b_p_desc'the structure of the #标题, URL, and description is a loose structure that can be improved later forIndexinchRange (len (posttitle)): Item=Cnblogsitem () item['title'] = Posttitle[index].css ("a"). XPath ('text ()'). Extract () [0] #print item['title'] +"***************\r\n"item['Link'] = Posttitle[index].css ('a'). XPath ('@href'). Extract () [0] item['listURL'] =Base_url item['desc'] = Postcon[index].xpath ('text ()'). Extract () [0] #print Base_url+"********\n"items.append (item) #print repr (item). Decode ("Unicode-escape") +'\ n'return Items
Attention:
The first line to be set to:#coding =utf-8 or #-*-coding:utf-8-*- Oh! Otherwise you will get an error.
' \xe5 ' inch file , but no encoding declared; See http://python.org/dev/peps/pep-0263/for details
the name of the spider is: Cnblogsspider , which will be used later.
5. Modify the pipelines.py file
[Email protected] cnblogs]#VIcnblogs/pipelines.py#-*-coding:utf-8-*-# Define Your item pipelines here## Don't forget to add your pipeline to the Item_pipelines setting# See:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlFrom scrapy import signalsimport jsonimport codecsclass jsonwithencodingcnblogspipeline (Object): def __init__ (self): self.file= Codecs.open ('Cnblogs.json','W', encoding='Utf-8') def process_item (self, item, spider): line= Json.dumps (Dict (item), Ensure_ascii=false) +"\ n"Self .file.Write[Line] return item def spider_closed (self, spider): self.file. Close ()
Note that the class name is Jsonwithencodingcnblogspipeline Oh, yes! Settings.py will be used in the
6, modify the settings.py, add the following two configuration items
Item_pipelines = { 'cnblogs.pipelines.JsonWithEncodingCnblogsPipeline' - ,}
Log_level = ' INFO '
7. Run Spider,scrapy Crawl crawler name (name defined in cnblogs_spider.py)
[Email protected] cnblogs]# scrapy crawl Cnblogsspider
8. View results more Cnblogs.json (name defined in pipelines.py )
More
9, if there is a need to convert the results into TXT text format, you can refer to another article Python to convert JSON-formatted data into text-formatted data or SQL files
Source code can be downloaded here: Https://github.com/jackgitgz/CnblogsSpider
10, I believe we will still have doubts, can we keep the data directly in the database? The answer is yes, the next article will be introduced, please look forward to.
Resources:
http://doc.scrapy.org/en/master/
http://blog.csdn.net/HanTangSongMing/article/details/24454453
Scrapy Crawler Growth Diary Creation project-extract data-save data in JSON format