[Python爬蟲]高並發cnblogs部落格備份工具(可擴充成並行)

來源:互聯網
上載者:User

標籤:address   art   lambda   path   write   utf-8   cluster   代碼   參數   

並發爬蟲小練習。

直接粘貼到本地,命名為.py檔案即可運行,運行時的參數為你想要爬取的使用者。預設是本部落格。

輸出是以使用者名稱命名的目錄,目錄內便是部落格內容。

僅供學習python的多線程編程方法,後續會重寫成並行爬蟲。

爬蟲代碼如下:

  1 # -*- coding:utf-8 -*-  2 from multiprocessing.managers import BaseManager  3 from pyquery import PyQuery  4 import os, sys, urllib  5 import re, random, logging, time  6 import Queue, threading, multiprocessing, threadpool  7   8 USER_NAME = ‘kirai‘  9 TOTAL_PAGE_NUMBER = 0 10 INT_REGEXP = re.compile(‘([\d]+)‘) 11 BASE_URL = ‘http://www.cnblogs.com/‘+USER_NAME+‘/p/?page=‘ 12 ARTICLE_REGEXP = re.compile(‘href=\"(http://www.cnblogs.com/‘+USER_NAME+‘/p/[\d]+.html)\"‘) 13 THREAD_NUMBER = multiprocessing.cpu_count() * 2 14 ARTICLE_URLS_MUTEX = threading.Lock() 15 ARTICLE_URLS = [] 16  17 class ListWithLinkExtend(list): 18     def extend(self, value): 19         super(ListWithLinkExtend, self).extend(value) 20         return self 21  22 def get_total_page_number(): 23     doc = PyQuery(url=BASE_URL) 24     return int(INT_REGEXP.findall( 25         doc.find(‘.pager .Pager‘).text())[0].encode(‘ascii‘)) 26  27 def get_page_url(): 28     global TOTAL_PAGE_NUMBER 29     return map(lambda page: BASE_URL+str(page), 30                          [i for i in range(1, TOTAL_PAGE_NUMBER+1)]) 31  32 def get_article_url(idx): 33     url = PAGE_URLS[idx] 34     doc = PyQuery(url=url) 35     article_urls = ARTICLE_REGEXP.findall(str(doc.find(‘.PostList .postTitl2‘))) 36     return article_urls 37  38 def handle_result(request, result): 39     global ARTICLE_URLS_MUTEX, ARTICLE_URLS 40     try: 41         ARTICLE_URLS_MUTEX.acquire() 42         ARTICLE_URLS.append(result) 43     finally: 44         ARTICLE_URLS_MUTEX.release() 45  46 def cluster_process(): 47     global ARTICLE_URLS 48     # list : urls 49     task_queue = Queue.Queue() 50     # str : path 51     result_queue = Queue.Queue() 52     KiraiManager.register(‘get_task_queue‘, callable=lambda: task_queue) 53     KiraiManager.register(‘get_result_queue‘, callable=lambda: result_queue) 54     manager = KiraiManager(address=(‘‘, 6969), authkey=‘whosyourdaddy‘) 55     manager.start() 56     manager.shutdown() 57     # article_flag, article_urls = get_article_url() 58  59 # a simple way. 60 def get_article(url): 61     html = urllib.urlopen(url).read() 62     return html, INT_REGEXP.findall(url)[0] 63  64 def save_article(request, result): 65     content = result[0] 66     file_name = result[1] 67     path = ‘./‘ + USER_NAME + ‘/‘ + file_name + ‘.html‘ 68     try: 69         fp = file(path, ‘w‘) 70         fp.writelines(content) 71     finally: 72         fp.close() 73  74 def thread_process(): 75     global ARTICLE_URLS 76     os.mkdir(USER_NAME) 77     thread_pool = threadpool.ThreadPool(THREAD_NUMBER) 78     requests = threadpool.makeRequests(get_article, ARTICLE_URLS, save_article) 79     [thread_pool.putRequest(req) for req in requests] 80     thread_pool.wait() 81  82 def __main__(argv): 83     global ARTICLE_URLS, TOTAL_PAGE_NUMBER, USER_NAME, BASE_URL, ARTICLE_REGEXP, PAGE_URLS, TOTAL_PAGE_NUMBER 84     if len(argv) == 2: 85         USER_NAME = argv[1] 86     BASE_URL = ‘http://www.cnblogs.com/‘+USER_NAME+‘/p/?page=‘ 87     ARTICLE_REGEXP = re.compile(‘href=\"(http://www.cnblogs.com/‘+USER_NAME+‘/p/[\d]+.html)\"‘) 88     TOTAL_PAGE_NUMBER = get_total_page_number() 89     PAGE_URLS = get_page_url() 90     thread_pool = threadpool.ThreadPool(THREAD_NUMBER) 91     requests = threadpool.makeRequests( 92         get_article_url, 93         [i for i in range(0, TOTAL_PAGE_NUMBER)], 94         handle_result) 95     [thread_pool.putRequest(req) for req in requests] 96     thread_pool.wait() 97     ARTICLE_URLS = list(reduce(lambda a, b: ListWithLinkExtend(a).extend(ListWithLinkExtend(b)), 98                                                          ARTICLE_URLS)) 99     thread_process()100 101 if __name__ == ‘__main__‘:102     __main__(sys.argv)

 

簡單介紹下全域變數的意義:

USER_NAME:希望爬取的使用者名稱,預設為kirai。

TOTAL_PAGE_NUMBER:會被更新成部落格隨筆的總頁數。

INT_REGEXP:為了匹配數位正則。
BASE_URL:隨筆頁的初始URL。

ARTICLE_REGEXP:在經過pyquery處理過後的每個隨筆目錄頁中提取出部落格文章頁面的正則。

THREAD_NUMBER:線程數,預設設定是本機cpu核心數的2倍。

ARTICLE_URLS_MUTEX:ARTICLE_URLS的鎖,保證線程唯一佔用。

ARTICLE_URLS:用於存放所有的文章url。

[Python爬蟲]高並發cnblogs部落格備份工具(可擴充成並行)

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.