Python爬蟲Csdn系列III

最後更新：2015-04-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python爬蟲 csdn部落格爬蟲

Python爬蟲Csdn系列III

By 白熊花田(http://blog.csdn.net/whiterbear) 轉載需註明出處，謝謝。

說明：

在上一篇部落格中，我們已經能夠擷取一個使用者所有文章的連結了，那麼這一節自然就是要將這些部落格下載下來咯。

分析：

有了連結下載文章自然是不難。但是，擷取的資料該怎麼處理？每一篇文章都帶有格式換行這些資訊，自然，我們儲存它們也是要儲存其對應的html格式的資料的（注意，我們編輯的帶有格式的部落格或者其他文本都是以html代碼格式儲存的）。如何存？使用資料庫，每篇文章字數都挺大的，沒必要用資料庫，還是隱藏檔更方便。

這裡我將下載的每篇部落格都取出文章的div部分，然後給這個部分添加必要的html頭部及其尾部，將其封裝成一個完整的html文本，最後再儲存成html格式的檔案。這裡要注意的就是在html中一定要添加<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />這句話，否則顯示的會是亂碼。

核心代碼在前兩篇文章已經提到了，難度也不是很大。

代碼：

#-*- coding:utf-8 -*-import sysimport osimport codecsimport urllibimport urllib2import cookielibimport MySQLdbimport refrom bs4 import BeautifulSoupfrom article import CsdnArticlereload(sys)sys.setdefaultencoding('utf-8')class CsdnCrawler(object):def __init__(self, author = 'whiterbear'):self.author = authorself.domain = 'http://blog.csdn.net/'self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36'}self.articles = []#給定url，得到所有的文章listsdef getArticleLists(self, url= None):req = urllib2.Request(url, headers=self.headers)response = urllib2.urlopen(req)soup = BeautifulSoup(''.join(response.read()))listitem =  soup.find(id='article_list').find_all(attrs={'class':r'list_item article_item'})href_regex = r'href="(.*?)"'for i,item in enumerate(listitem):enitem = item.find(attrs={'class':'link_title'}).contents[0].contents[0]href = re.search(href_regex,str(item.find(attrs={'class':'link_title'}).contents[0])).group(1)art = CsdnArticle()art.author = self.authorart.title = enitem.lstrip()art.href = (self.domain + href[1:]).lstrip()self.articles.append(art)def getPageLists(self, url= None):url = url if url else 'http://blog.csdn.net/%s?viewmode=list'%self.authorreq = urllib2.Request(url, headers=self.headers)response = urllib2.urlopen(req)soup = BeautifulSoup(''.join(response.read()))num_regex = '[1-9]\d*'pagelist = soup.find(id='papelist')self.getArticleLists(url)if pagelist:pagenum = int(re.findall(num_regex, pagelist.contents[1].contents[0])[1])for i in range(2, pagenum + 1):self.getArticleLists(self.domain + self.author + '/article/list/%s'%i)def getAllArticles(self):#我們建立一個該作者的檔案夾來存放作者的文章if not os.path.exists(self.author):os.mkdir(self.author)for subarticle in self.articles:articleurl = subarticle.href#依次開啟每一篇文章並下載req = urllib2.Request(articleurl, headers=self.headers)response = urllib2.urlopen(req)soup = BeautifulSoup(''.join(response.read()))article_content = soup.find(id='article_content')title = subarticle.title.rstrip().encode('utf-8')#將提取的內容封裝成html格式的字串article = u'<html><head><title>%s</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>%s</body></html>' % (title, article_content)fobj = codecs.open(u'%s/%s.htm'%(self.author, title),'w','utf-8')fobj.writelines(article.encode('utf-8'))fobj.close()def main():csdn = CsdnCrawler(author='whiterbear')#'pongba'，可以將填入你想下載的博主的部落格csdn.getPageLists()csdn.getAllArticles()if __name__ == '__main__':main()

結果：

產生了該博主命名的檔案夾，檔案夾中包含該博主的所有文章。如：

隨便開啟一個文章，顯示的效果是這樣的：（顯示的效果很贊。）

感悟：

1> 中文編碼問題。雖然已經瞭解了幾種編碼問題的解決方式，但是還是常常被這個問題給卡住。

2> 保持代碼的正交性。雖然我還沒做過大項目，但是已經能夠感受到，如果兩個模組的正交性提高，即一個模組的改動並不會影響到另一個模組的正常運行。這樣子能夠迫使你去思考一種清晰的架構，而不會寫了一團糟的代碼。

3> 常見的錯覺，總覺得這個很簡單啊，今天就可以做完啊，結果總是遇到這樣那樣的問題，還是缺少經驗。

4> 其他：保持代碼的整潔，嘗試迭代，從小的代碼開始一點點往上累計新的代碼，時刻保持兩個版本（其中一個含有大量輸出來幫你確定每一步發生了什麼）。

下個系列可能就要開始做微博的爬蟲了，會涉及到相關的資料處理和分析，希望能順利點。

Python爬蟲Csdn系列III

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More