Python crawls the CSDN blog channel and pythoncsdn blog Channel
Python is easy to write. The development tool PyCharm and python 3.4 are very convenient.
When installing some python modules, you need other ancillary modules.
Pip install wheel
Then you can directly download the whl file for installation.
Pip install lxml-3.5.0-cp34-none-win32.whl
Define a class to be saved
class CnblogArticle: def __init__(self): self.num='' self.category='' self.title='' self.author='' self.postTime='' self.articleComment='' self.articleView=''
Because the CSDN blog channel has only 18 pages, the resolution of 18 pages includes multi-threaded parsing (main annotation) and common parsing. In the main method
Note: each item is distinguished by class = blog_list. Some items have class = category, but a few do not. Note that, otherwise, an error is reported.
<Div class = "blog_list">
<Div class = "blog_list"> Beautiful Soup 4.2.0 documents can be directly viewed on the official website
# -*- coding:utf-8 -*-from bs4 import BeautifulSoupimport urllib.requestimport osimport sysimport timeimport threadingclass CnblogUtils(object): def __init__(self): self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'} self.contentAll=set() def getPage(self,url=None): request=urllib.request.Request(url,headers=self.headers) response=urllib.request.urlopen(request) soup=BeautifulSoup(response.read(),"lxml") return soup def parsePage(self,url=None,page_num=None): soup=self.getPage(url) itemBlog=soup.find_all('div','blog_list') cnArticle=CnblogUtils for i,itemSingle in enumerate(itemBlog): cnArticle.num=i cnArticle.author=itemSingle.find('a','user_name').string cnArticle.postTime=itemSingle.find('span','time').string cnArticle.articleComment=itemSingle.find('a','comment').string cnArticle.articleView=itemSingle.find('a','view').string if itemSingle.find('a').has_attr('class'): cnArticle.category=itemSingle.find('a','category').string cnArticle.title=itemSingle.find('a',attrs={'name':True}).string else: cnArticle.category="None" cnArticle.title=itemSingle.find('a').string self.contentAll.add(str(cnArticle.author)) self.writeFile(page_num,cnArticle.num,cnArticle.author,cnArticle.postTime,cnArticle.articleComment,cnArticle.articleView,cnArticle.category,cnArticle.title) def writeFile(self,page_num,num,author,postTime,articleComment,articleView,category,title): f=open("a.txt",'a+') f.write(str('page_num is {}'.format(page_num))+'\t'+str(num)+'\t'+str(author)+'\t'+str(postTime)+'\t'+str(articleComment)+'\t'+str(articleView)+'\t'+str(category)+'\t'+str(title)+'\n') f.close()def main(thread_num): start=time.clock() cnblog=CnblogUtils() ''' thread_list = list(); for i in range(0, thread_num): thread_list.append(threading.Thread(target = cnblog.parsePage, args = ('http://blog.csdn.net/?&page={}'.format(i),i+1,))) for thread in thread_list: thread.start() for thread in thread_list: thread.join() print(cnblog.contentAll) ''' for i in range(0,18): cnblog.parsePage('http://blog.csdn.net/?&page={}'.format(i),i+1) end=time.clock() print('time = {}'.format(end-start))if __name__ == '__main__': main(18)
Program running result:
Page_num is 1 0 foruok 18 minutes ago comments (0) read (0) [programming language] in Windows SKIApage_num is 1 u013467442 31 minutes ago comments (0) read (3) [programming language] Cubieboard learning resource page_num is 1 2 tuke_tuke 32 minutes ago comments (0) read (15) [mobile development] AdapterView of UI components and its subclass relationship, adapter interface and its implementation class relation page_num is 1 3 xiaominghimi 53 minutes ago comments (0) read (51) [Mobile development] [COCOS2D-X Remarks] ASSETMANAGEREX use exception to solve the remarks-> CHECK_JNI/CC 'JAVA. LANG. NOCLASSDEFFOUNDERROR 'page _ num is 1 4 shinian1987 1 hour ago comments (0) read (64) [comprehensive] Python: scikit-image canny edge detection page_num is 1 5 u010579068 1 hour ago comments (0) read (90) comparison of None STL _ algorithm for_each and transform page_num is 1 6 u013467442 1 hour ago comments (0) read (94) [programming language] OpenGLES2.0 coloring language glslpage_num is 1 7 u013467442 1 hour ago comments (0) read (89) [programming language] OpenGl Coordinate Transformation page_num is 1 8 javasongzk 1 hour ago comments (0) read (95) [programming language] bzoj4390 [Usaco2015 Dec] Max Flowpage_num is 1 9 running ongzk 1 hour ago comments (0) read (95) [programming language] bzoj1036 [ZJOI2008] Count Countpage_num is 1 10 danhuang2012 1 hour ago comments (0) read (90) [programming language] Node. how does js handle robustness page_num is 1 11 EbowTang 1 hour ago comments (0) read (102) [programming language] <LeetCode OJ> 121. best Time to Buy and comment Stockpage_num is 1 12 cartzhang 2 hours ago comments (0) read (98) [architecture design] Add the memory tracking function page_num is 1 13 u013595419 2 hours ago comments (0) read (93) [comprehensive] Chapter 1 exercise Question 3 basic operation of the shared stack page_num is 1 14 ghostbear comments 2 hours ago (0) read (2nd) [system O & M] Dynamics CRM 1st Series: overviewpage_num is 1 15 u014723529 2 hours ago comments (0) read (116) [programming language] restores the string of the Date object returned by the BeanUtils getProperty Method to the object page_num is 1 16 Evankaka 2 hours ago (1) read (142) [architecture design] Jenkins detailed installation and build deployment tutorial page_num is 1 17 Evankaka 2 hours ago (0) read (141) [programming language] install and configure JDK, Tomcat, and SVN servers in Ubuntu
Multithreading may report an error when the network speed is low.
After obtaining the data, you can perform data analysis, or perform in-depth searches to obtain the blog corresponding to author based on author.