Grab Cnblog article content using the Python scrapy framework

Last Update:2016-06-12 Source: Internet

Author: User

Tags xpath mysql login python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy documents please move to http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html

1. Preparatory work

Install Python, Spyder, scrapy if you want data to go directly to MySQL, you also need to install Python's MySQLdb dependency package

I installed MySQLdb Mac operating system when there are some minor problems and finally, it's a reload of OpenSSL.

The Spyder is the IDE that writes Python

2. New Project

Cd/usr/local/var/www/python

Execute Scrapy startproject myblog a new project called MyBlog, and the MyBlog folder will appear in your Python folder after execution is complete.

Cnblog_spider.py was later my new suffix. PYc is a compiled file that executes Python, and the other is the files that are automatically generated after the project is created.

3, write the crawler script cnblog_spider.py

Analyze Cnblog's website using scrapy shell http://www.cnblogs.com/threemore/

Use Google browser to find the data you want to crawl not much to say directly on the code, I grabbed the title of the Cnblog article, link time, the article ID, body content

#-*-coding:utf-8-*- fromScrapy.spiderImportSpider fromScrapy.selectorImportSelector fromMyblog.itemsImportMyblogitemImportscrapyImportRe#site_url = ' http://www.cnblogs.com/threemore/'#crawling articles in CnblogclassCnblogspider (Spider):#fetch name after the command is executed the name Scrapy crawl Cnblog cnblog is defined hereName ='Cnblog'Allow_domains= ["cnblogs.com"]            #define the crawled URLsStart_urls = [        'http://www.cnblogs.com/threemore/'    ]        #Execute function    defParse (self,response): Sel=Selector (response) Self.log ("begins% s"%response.url) Article_list= Sel.css ('Div.posttitle'). XPath ('a')                #Crawl the contents of the list and then recycle the content page data of the list.         forArticleinchArticle_list:url= Article.xpath ('@href'). Extract () [0] Self.log ("list article URL:% s"%URL)#Continue crawling content page data            yieldScrapy. Request (url,callback=self.parse_content)#If you have the next page to continue crawling dataNext_pages = Sel.xpath ('//*[@id = "Nav_next_page"]/a/@href')                ifNext_pages:next_page=next_pages.extract () [0]#Print Next_pageSelf.log ("next_page:% s"%next_page)#call yourself recursively in PHP functions like            yieldScrapy. Request (next_page,callback=self.parse)#content page Fetching    defparse_content (self,response): Self.log ("detail Views:% s"%Response.url)#The defined item only needs to define the fields of the captured data in the items file .item =Myblogitem ()#XPath looks for data that needs to be crawled in the pageitem['Link'] =Response.url#regular matches the ID of the article in Cnblogm = Re.search (r"([0-9]) +", item['Link'])        ifm:item['Aid'] =m.group (0)Else: item['Aid'] =0; item['title'] = Response.xpath ('//*[@id = "Cb_post_title_url"]/text ()'). Extract () [0] item['content'] = Response.xpath ('//*[@id = "Cnblogs_post_body"]'). Extract () [0] item['Date'] = Response.xpath ('//*[@id = "Post-date"]'). Extract ()#print item[' content '        yieldItem

4. Data warehousing

To write a pipeline program pipelines.py, a pipeline is a crawler that stores data using the last yield item will give the data to pipelines.py this file

I configured two MySQL login information for testing and formal environment convenience

Before each execution will be the data sheet to be put into storage to prevent duplicate acquisition, directly read the code

#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html#You need to set item_pipelines in the setting.py file to configure the previous comment to the current file .#the current pipeline is so configured ' Myblog.pipelines.MyblogPipeline ':ImportMysqldb,datetimedebug=True#define MySQL in a test environment and in a formal environmentifDebug:dbuser='Root'Dbpass='Root'dbname='Test'Dbhost='127.0.0.1'Dbport='3306'Else: Dbuser='Root'Dbpass='Root'dbname='Test'Dbhost='127.0.0.1'Dbport='3306'classMyblogpipeline (object):#initializing a linked database    def __init__(self): Self.conn= MySQLdb.connect (User=dbuser, Passwd=dbpass, Db=dbname, Host=dbhost, charset="UTF8", use_unicode=True) Self.cursor=self.conn.cursor () Self.cursor.execute ('TRUNCATE TABLE Test_cnbog') Self.conn.commit ()#Execute SQL statement    defProcess_item (self, item, spider):Try: Self.cursor.execute ("""INSERT into Test_cnbog (title, link, aid,content,date) VALUES (%s,%s,%s,%s,%s)""", (item['title'].encode ('Utf-8'), item['Link'].encode ('Utf-8'), item['Aid'], item['content'].encode ('Utf-8') , Datetime.datetime.now (),) Self.conn.commit () /c12>exceptMysqldb.error, E:PrintU'Error%d: $s'% (e.args[0],e.args[1])                returnItem

5, Configuration setting.py

Turn on inbound configuration

Find Item_pipelines to remove the previous comment from the comment above the code. Direct access to see what's going on. The official website looks like the data was written to Monge.

I am not familiar with Monge directly to MySQL go to the general meaning that pipelines.py this file is to tell you where the data collected

# Configure Item Pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html Item_pipelines = {    'myblog.pipelines.MyblogPipeline': +,}

6. Perform collection

Execute under the project's folder: Scrapy crawl MyBlog

Deliberately will crawl Baidu translation to see what the meaning of the original is "crawling"

Finally show the data collected

15 None of the data aid program is to take the regular and casually handle the next

Grab Cnblog article content using the Python scrapy framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More