Grab Cnblog article content using the Python scrapy framework

Source: Internet
Author: User
Tags xpath mysql login python scrapy

Scrapy documents please move to http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html

1. Preparatory work

Install Python, Spyder, scrapy if you want data to go directly to MySQL, you also need to install Python's MySQLdb dependency package

I installed MySQLdb Mac operating system when there are some minor problems and finally, it's a reload of OpenSSL.

The Spyder is the IDE that writes Python

2. New Project

Cd/usr/local/var/www/python

Execute Scrapy startproject myblog a new project called MyBlog, and the MyBlog folder will appear in your Python folder after execution is complete.

Cnblog_spider.py was later my new suffix. PYc is a compiled file that executes Python, and the other is the files that are automatically generated after the project is created.

3, write the crawler script cnblog_spider.py

Analyze Cnblog's website using scrapy shell http://www.cnblogs.com/threemore/

Use Google browser to find the data you want to crawl not much to say directly on the code, I grabbed the title of the Cnblog article, link time, the article ID, body content

#-*-coding:utf-8-*- fromScrapy.spiderImportSpider fromScrapy.selectorImportSelector fromMyblog.itemsImportMyblogitemImportscrapyImportRe#site_url = ' http://www.cnblogs.com/threemore/'#crawling articles in CnblogclassCnblogspider (Spider):#fetch name after the command is executed the name Scrapy crawl Cnblog cnblog is defined hereName ='Cnblog'Allow_domains= ["cnblogs.com"]            #define the crawled URLsStart_urls = [        'http://www.cnblogs.com/threemore/'    ]        #Execute function    defParse (self,response): Sel=Selector (response) Self.log ("begins% s"%response.url) Article_list= Sel.css ('Div.posttitle'). XPath ('a')                #Crawl the contents of the list and then recycle the content page data of the list.         forArticleinchArticle_list:url= Article.xpath ('@href'). Extract () [0] Self.log ("list article URL:% s"%URL)#Continue crawling content page data            yieldScrapy. Request (url,callback=self.parse_content)#If you have the next page to continue crawling dataNext_pages = Sel.xpath ('//*[@id = "Nav_next_page"]/a/@href')                ifNext_pages:next_page=next_pages.extract () [0]#Print Next_pageSelf.log ("next_page:% s"%next_page)#call yourself recursively in PHP functions like            yieldScrapy. Request (next_page,callback=self.parse)#content page Fetching    defparse_content (self,response): Self.log ("detail Views:% s"%Response.url)#The defined item only needs to define the fields of the captured data in the items file .item =Myblogitem ()#XPath looks for data that needs to be crawled in the pageitem['Link'] =Response.url#regular matches the ID of the article in Cnblogm = Re.search (r"([0-9]) +", item['Link'])        ifm:item['Aid'] =m.group (0)Else: item['Aid'] =0; item['title'] = Response.xpath ('//*[@id = "Cb_post_title_url"]/text ()'). Extract () [0] item['content'] = Response.xpath ('//*[@id = "Cnblogs_post_body"]'). Extract () [0] item['Date'] = Response.xpath ('//*[@id = "Post-date"]'). Extract ()#print item[' content '        yieldItem

4. Data warehousing

To write a pipeline program pipelines.py, a pipeline is a crawler that stores data using the last yield item will give the data to pipelines.py this file

I configured two MySQL login information for testing and formal environment convenience

Before each execution will be the data sheet to be put into storage to prevent duplicate acquisition, directly read the code

#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html#You need to set item_pipelines in the setting.py file to configure the previous comment to the current file .#the current pipeline is so configured ' Myblog.pipelines.MyblogPipeline ':ImportMysqldb,datetimedebug=True#define MySQL in a test environment and in a formal environmentifDebug:dbuser='Root'Dbpass='Root'dbname='Test'Dbhost='127.0.0.1'Dbport='3306'Else: Dbuser='Root'Dbpass='Root'dbname='Test'Dbhost='127.0.0.1'Dbport='3306'classMyblogpipeline (object):#initializing a linked database    def __init__(self): Self.conn= MySQLdb.connect (User=dbuser, Passwd=dbpass, Db=dbname, Host=dbhost, charset="UTF8", use_unicode=True) Self.cursor=self.conn.cursor () Self.cursor.execute ('TRUNCATE TABLE Test_cnbog') Self.conn.commit ()#Execute SQL statement    defProcess_item (self, item, spider):Try: Self.cursor.execute ("""INSERT into Test_cnbog (title, link, aid,content,date) VALUES (%s,%s,%s,%s,%s)""", (item['title'].encode ('Utf-8'), item['Link'].encode ('Utf-8'), item['Aid'], item['content'].encode ('Utf-8') , Datetime.datetime.now (),) Self.conn.commit () /c12>exceptMysqldb.error, E:PrintU'Error%d: $s'% (e.args[0],e.args[1])                returnItem

5, Configuration setting.py

Turn on inbound configuration

Find Item_pipelines to remove the previous comment from the comment above the code. Direct access to see what's going on. The official website looks like the data was written to Monge.

I am not familiar with Monge directly to MySQL go to the general meaning that pipelines.py this file is to tell you where the data collected

# Configure Item Pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html Item_pipelines = {    'myblog.pipelines.MyblogPipeline': +,}

6. Perform collection

Execute under the project's folder: Scrapy crawl MyBlog

Deliberately will crawl Baidu translation to see what the meaning of the original is "crawling"

Finally show the data collected

15 None of the data aid program is to take the regular and casually handle the next

Grab Cnblog article content using the Python scrapy framework

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.