Python crawls the CSDN blog channel and pythoncsdn blog Channel

Source: Internet
Author: User

Python crawls the CSDN blog channel and pythoncsdn blog Channel

Python is easy to write. The development tool PyCharm and python 3.4 are very convenient.

When installing some python modules, you need other ancillary modules.

Pip install wheel

Then you can directly download the whl file for installation.

Pip install lxml-3.5.0-cp34-none-win32.whl

Define a class to be saved

class CnblogArticle:    def __init__(self):        self.num=''        self.category=''        self.title=''        self.author=''        self.postTime=''        self.articleComment=''        self.articleView=''

Because the CSDN blog channel has only 18 pages, the resolution of 18 pages includes multi-threaded parsing (main annotation) and common parsing. In the main method

Note: each item is distinguished by class = blog_list. Some items have class = category, but a few do not. Note that, otherwise, an error is reported.

<Div class = "blog_list"> 
<Div class = "blog_list"> Beautiful Soup 4.2.0 documents can be directly viewed on the official website
# -*- coding:utf-8 -*-from bs4 import BeautifulSoupimport urllib.requestimport osimport sysimport timeimport threadingclass CnblogUtils(object):    def __init__(self):        self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'}        self.contentAll=set()    def getPage(self,url=None):        request=urllib.request.Request(url,headers=self.headers)        response=urllib.request.urlopen(request)        soup=BeautifulSoup(response.read(),"lxml")        return soup    def parsePage(self,url=None,page_num=None):        soup=self.getPage(url)        itemBlog=soup.find_all('div','blog_list')        cnArticle=CnblogUtils        for i,itemSingle in enumerate(itemBlog):            cnArticle.num=i            cnArticle.author=itemSingle.find('a','user_name').string            cnArticle.postTime=itemSingle.find('span','time').string            cnArticle.articleComment=itemSingle.find('a','comment').string            cnArticle.articleView=itemSingle.find('a','view').string            if itemSingle.find('a').has_attr('class'):                cnArticle.category=itemSingle.find('a','category').string                cnArticle.title=itemSingle.find('a',attrs={'name':True}).string            else:                cnArticle.category="None"                cnArticle.title=itemSingle.find('a').string            self.contentAll.add(str(cnArticle.author))            self.writeFile(page_num,cnArticle.num,cnArticle.author,cnArticle.postTime,cnArticle.articleComment,cnArticle.articleView,cnArticle.category,cnArticle.title)    def writeFile(self,page_num,num,author,postTime,articleComment,articleView,category,title):        f=open("a.txt",'a+')        f.write(str('page_num is {}'.format(page_num))+'\t'+str(num)+'\t'+str(author)+'\t'+str(postTime)+'\t'+str(articleComment)+'\t'+str(articleView)+'\t'+str(category)+'\t'+str(title)+'\n')        f.close()def main(thread_num):    start=time.clock()    cnblog=CnblogUtils()    '''    thread_list = list();    for i in range(0, thread_num):        thread_list.append(threading.Thread(target = cnblog.parsePage, args = ('http://blog.csdn.net/?&page={}'.format(i),i+1,)))    for thread in thread_list:        thread.start()    for thread in thread_list:        thread.join()    print(cnblog.contentAll)    '''    for i in range(0,18):        cnblog.parsePage('http://blog.csdn.net/?&page={}'.format(i),i+1)    end=time.clock()    print('time = {}'.format(end-start))if __name__ == '__main__':    main(18)

 

Program running result:

Page_num is 1 0 foruok 18 minutes ago comments (0) read (0) [programming language] in Windows SKIApage_num is 1 u013467442 31 minutes ago comments (0) read (3) [programming language] Cubieboard learning resource page_num is 1 2 tuke_tuke 32 minutes ago comments (0) read (15) [mobile development] AdapterView of UI components and its subclass relationship, adapter interface and its implementation class relation page_num is 1 3 xiaominghimi 53 minutes ago comments (0) read (51) [Mobile development] [COCOS2D-X Remarks] ASSETMANAGEREX use exception to solve the remarks-> CHECK_JNI/CC 'JAVA. LANG. NOCLASSDEFFOUNDERROR 'page _ num is 1 4 shinian1987 1 hour ago comments (0) read (64) [comprehensive] Python: scikit-image canny edge detection page_num is 1 5 u010579068 1 hour ago comments (0) read (90) comparison of None STL _ algorithm for_each and transform page_num is 1 6 u013467442 1 hour ago comments (0) read (94) [programming language] OpenGLES2.0 coloring language glslpage_num is 1 7 u013467442 1 hour ago comments (0) read (89) [programming language] OpenGl Coordinate Transformation page_num is 1 8 javasongzk 1 hour ago comments (0) read (95) [programming language] bzoj4390 [Usaco2015 Dec] Max Flowpage_num is 1 9 running ongzk 1 hour ago comments (0) read (95) [programming language] bzoj1036 [ZJOI2008] Count Countpage_num is 1 10 danhuang2012 1 hour ago comments (0) read (90) [programming language] Node. how does js handle robustness page_num is 1 11 EbowTang 1 hour ago comments (0) read (102) [programming language] <LeetCode OJ> 121. best Time to Buy and comment Stockpage_num is 1 12 cartzhang 2 hours ago comments (0) read (98) [architecture design] Add the memory tracking function page_num is 1 13 u013595419 2 hours ago comments (0) read (93) [comprehensive] Chapter 1 exercise Question 3 basic operation of the shared stack page_num is 1 14 ghostbear comments 2 hours ago (0) read (2nd) [system O & M] Dynamics CRM 1st Series: overviewpage_num is 1 15 u014723529 2 hours ago comments (0) read (116) [programming language] restores the string of the Date object returned by the BeanUtils getProperty Method to the object page_num is 1 16 Evankaka 2 hours ago (1) read (142) [architecture design] Jenkins detailed installation and build deployment tutorial page_num is 1 17 Evankaka 2 hours ago (0) read (141) [programming language] install and configure JDK, Tomcat, and SVN servers in Ubuntu

Multithreading may report an error when the network speed is low.

After obtaining the data, you can perform data analysis, or perform in-depth searches to obtain the blog corresponding to author based on author.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.