標籤:python 爬蟲 python擷取csdn資訊 爬csdn python爬蟲
1.原理:
這個程式可以實現批量擷取到某一個CSDN部落格的個人資訊、目錄與連結的對應,並存到一個本目錄的mulu.txt檔案中
2.具體代碼:
# -*- coding: cp936 -*-import urllib.request#import re#import sys#import time#import randomimport stringheaders = {# 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' }url_end=[]#title_end=[]for n in range(2): req = urllib.request.Request( url = 'http://blog.csdn.net/wangquannetwork/article/list/'+str((n+1)), headers = headers ) content = urllib.request.urlopen(req).read() content = content.decode('utf-8') p=re.compile('\r\n') content=re.sub(p,'',content) url_str = re.findall('((?<=(link_title\"><a href=\")).*?(?=\"))',content) for i in range(len(url_str)): url_end.append('blog.csdn.net'+url_str[i][0]) title_str = re.findall('((?<=([0-9][0-9][0-9][0-9][0-9]\">)).*?(?=(</a></span>)))',content) for i in range(len(title_str)): title_end.append(title_str[i][0][8:])content = urllib.request.urlopen(req).read()content = content.decode('utf-8')span_str = re.findall(r'(?<=<li>).+?(?=</li>)',content)title_str = re.findall(r'(((?<=(k\">)).*?(?=(</a>))))',content)sName='./mulu.txt'with open(sName,'w') as file: file.write('這是 '+title_str[0][0]+' 的部落格') file.write('\n') file.write('下面是部落格資訊') file.write('\n') for x in range(0,5): file.write(span_str[x]) file.write('\n') file.write('\n') file.write('一共有'+str(len(url_end))+'個文章') file.write('\n') file.write('\n') for i in range(len(url_end)): file.write(str((i+1))+'.') file.write(title_end[i]) file.write('\n') file.write(url_end[i]) file.write('\n')
3.Python代碼實現結果:
注意:以上內容均為原創作品 轉載請註明出處http://blog.csdn.net/wangquannetwork/article/details/45832109
Python爬蟲_用Python爬取csdn頁面資訊目錄