Scrapy documents please move to http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html
1. Preparatory work
Install Python, Spyder, scrapy if you want data to go directly to MySQL, you also need to install Python's MySQLdb dependency package
I installed MySQLdb Mac operating system when there are some minor problems and finally, it's a reload of OpenSSL.
The Spyder is the IDE that writes Python
2. New Project
Cd/usr/local/var/www/python
Execute Scrapy startproject myblog a new project called MyBlog, and the MyBlog folder will appear in your Python folder after execution is complete.
Cnblog_spider.py was later my new suffix. PYc is a compiled file that executes Python, and the other is the files that are automatically generated after the project is created.
3, write the crawler script cnblog_spider.py
Analyze Cnblog's website using scrapy shell http://www.cnblogs.com/threemore/
Use Google browser to find the data you want to crawl not much to say directly on the code, I grabbed the title of the Cnblog article, link time, the article ID, body content
#-*-coding:utf-8-*- fromScrapy.spiderImportSpider fromScrapy.selectorImportSelector fromMyblog.itemsImportMyblogitemImportscrapyImportRe#site_url = ' http://www.cnblogs.com/threemore/'#crawling articles in CnblogclassCnblogspider (Spider):#fetch name after the command is executed the name Scrapy crawl Cnblog cnblog is defined hereName ='Cnblog'Allow_domains= ["cnblogs.com"] #define the crawled URLsStart_urls = [ 'http://www.cnblogs.com/threemore/' ] #Execute function defParse (self,response): Sel=Selector (response) Self.log ("begins% s"%response.url) Article_list= Sel.css ('Div.posttitle'). XPath ('a') #Crawl the contents of the list and then recycle the content page data of the list. forArticleinchArticle_list:url= Article.xpath ('@href'). Extract () [0] Self.log ("list article URL:% s"%URL)#Continue crawling content page data yieldScrapy. Request (url,callback=self.parse_content)#If you have the next page to continue crawling dataNext_pages = Sel.xpath ('//*[@id = "Nav_next_page"]/a/@href') ifNext_pages:next_page=next_pages.extract () [0]#Print Next_pageSelf.log ("next_page:% s"%next_page)#call yourself recursively in PHP functions like yieldScrapy. Request (next_page,callback=self.parse)#content page Fetching defparse_content (self,response): Self.log ("detail Views:% s"%Response.url)#The defined item only needs to define the fields of the captured data in the items file .item =Myblogitem ()#XPath looks for data that needs to be crawled in the pageitem['Link'] =Response.url#regular matches the ID of the article in Cnblogm = Re.search (r"([0-9]) +", item['Link']) ifm:item['Aid'] =m.group (0)Else: item['Aid'] =0; item['title'] = Response.xpath ('//*[@id = "Cb_post_title_url"]/text ()'). Extract () [0] item['content'] = Response.xpath ('//*[@id = "Cnblogs_post_body"]'). Extract () [0] item['Date'] = Response.xpath ('//*[@id = "Post-date"]'). Extract ()#print item[' content ' yieldItem
4. Data warehousing
To write a pipeline program pipelines.py, a pipeline is a crawler that stores data using the last yield item will give the data to pipelines.py this file
I configured two MySQL login information for testing and formal environment convenience
Before each execution will be the data sheet to be put into storage to prevent duplicate acquisition, directly read the code
#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html#You need to set item_pipelines in the setting.py file to configure the previous comment to the current file .#the current pipeline is so configured ' Myblog.pipelines.MyblogPipeline ':ImportMysqldb,datetimedebug=True#define MySQL in a test environment and in a formal environmentifDebug:dbuser='Root'Dbpass='Root'dbname='Test'Dbhost='127.0.0.1'Dbport='3306'Else: Dbuser='Root'Dbpass='Root'dbname='Test'Dbhost='127.0.0.1'Dbport='3306'classMyblogpipeline (object):#initializing a linked database def __init__(self): Self.conn= MySQLdb.connect (User=dbuser, Passwd=dbpass, Db=dbname, Host=dbhost, charset="UTF8", use_unicode=True) Self.cursor=self.conn.cursor () Self.cursor.execute ('TRUNCATE TABLE Test_cnbog') Self.conn.commit ()#Execute SQL statement defProcess_item (self, item, spider):Try: Self.cursor.execute ("""INSERT into Test_cnbog (title, link, aid,content,date) VALUES (%s,%s,%s,%s,%s)""", (item['title'].encode ('Utf-8'), item['Link'].encode ('Utf-8'), item['Aid'], item['content'].encode ('Utf-8') , Datetime.datetime.now (),) Self.conn.commit () /c12>exceptMysqldb.error, E:PrintU'Error%d: $s'% (e.args[0],e.args[1]) returnItem
5, Configuration setting.py
Turn on inbound configuration
Find Item_pipelines to remove the previous comment from the comment above the code. Direct access to see what's going on. The official website looks like the data was written to Monge.
I am not familiar with Monge directly to MySQL go to the general meaning that pipelines.py this file is to tell you where the data collected
# Configure Item Pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html Item_pipelines = { 'myblog.pipelines.MyblogPipeline': +,}
6. Perform collection
Execute under the project's folder: Scrapy crawl MyBlog
Deliberately will crawl Baidu translation to see what the meaning of the original is "crawling"
Finally show the data collected
15 None of the data aid program is to take the regular and casually handle the next
Grab Cnblog article content using the Python scrapy framework