Python crawler Csdn Series II, python crawler csdn

Source: Internet
Author: User

Python crawler Csdn Series II, python crawler csdn
Python crawler Csdn Series II



By Bear flower (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.


Note:

In the previous article, we have learned that as long as the program is disguised as a browser, you can access the csdn web page. In this article, we will try to get links to all the articles of a csdn user.

Analysis:

Open a column for a csdn user, you can select a directory view (such as: http://blog.csdn.net/whiterbear? Viewmode = contents) and abstract view (for example: http://blog.csdn.net/whiterbear? Viewmode = list ). Both views can display the user's article list.

Note: here we select the summary view instead of the Directory view. The article will explain why.

Open the abstract view to view the web page source code. We found that in the div with the id of 'Article _ list', each sub-div represents an article,



Each sub-div contains the title, Link, number of reads, originality, comment count, and other information of an article. We only need to retrieve the title and link. It is not difficult to obtain a regular expression. We can use an array to save all the article names and links on the interface.

Note that if a blog has pagination, what should we do? We also need to get the link of the article on the next page?

I tried two methods. The first method is to set an article_list dictionary. The dictionary member is 'Next Page Link and whether the accessed key-Value Pair 'and is initially placed on the homepage link, each time a page is processed, the value of the link is set to access, and then the link on the next page is searched. If it is not in the dictionary, add it and set the access ID to 0.

For example, if the initial dictionary is article_list = {'/pongba/article/list/1': False}, set the value to True when processing the/pongba/article/list/1 interface, at this time, we found/pongba/article/list/2 and/pongba/article/list/3. In this case, we judge that the dictionary (has_key () does not have these two keys. Then we add them and set the value to False. And then traverse the (keys () dictionary. If there is a link with a value of 0, the link is accessed and repeated.

Method 2: the page number is displayed in the html code of the page. we extract the page number pagenum and combine it with/pongba/article/list/num. num is the page number, the value is [1, pagenum]. You can use this link to retrieve all the articles of the author.

I used the second method in the code. I tried the first method.


Code introduction:

The CsdnArticle class (article. py) is encapsulated into a document that is required for saving and accessing the attributes of an article.

I have rewritten the _ str _ () method to facilitate input.

#-*-Coding: UTF-8-*-class CsdnArticle (object): def _ init _ (self): # self. author = ''# blog post name self. title = ''# blog link self. href = ''# Blog content self. body = ''# stringized def _ str _ (self): return self. author + '\ t' + self. title + '\ t' + self. href + '\ t' + self. body

CsdnCrawler class. It encapsulates the operations to crawl all links of the csdn blog.

#-*-Coding: UTF-8-*-import sysimport urllibimport urllib2import refrom bs4 import BeautifulSoupfrom article import CsdnArticlereload (sys) sys. setdefaultencoding ('utf-8') class CsdnCrawler (object): # default access to my blog def _ init _ (self, author = 'whiterbear '): self. author = authorself. domain = 'HTTP: // blog.csdn.net/'self.headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36 '} # store the article object array self. articles = [] # Get all the articles listsdef getArticleLists (self, url = None) for the given url: req = urllib2.Request (url, headers = self. headers) response = urllib2.urlopen (req) soup = BeautifulSoup (''. join (response. read () listitem = soup. find (id = 'Article _ list '). find_all (attrs = {'class': r'list _ item article_item '}) # The regular expression of the link, which can match the link href_regex = r'href = "(. *?) "'For I, item in enumerate (listitem): enitem = item. find (attrs = {'class': 'link _ title '}). contents [0]. contents [0] href = re. search (href_regex, str (item. find (attrs = {'class': 'link _ title '}). contents [0]). group (1) # encapsulate the obtained article information into an object and store it in the array art = CsdnArticle () art. author = self. authorart. title = enitem. lstrip () art. href = (self. domain + href [1:]). lstrip () self. articles. append (art) def getPageLists (self, url = None ): Url = 'HTTP: // blog.csdn.net/s s? Viewmode = list' % self. authorreq = urllib2.Request (url, headers = self. headers) response = urllib2.urlopen (req) soup = BeautifulSoup (''. join (response. read () num_regex = '[1-9] \ d *' pagelist = soup. find (id = 'papelist') self. getArticleLists (url) # if the author has many blogs and pages, if pagelist: pagenum = int (re. findall (num_regex, pagelist. contents [1]. contents [0]) [1]) for I in range (2, pagenum + 1): self. getArticleLists (self. domain + self. author + '/article/list/% s' % I) def mytest (self): for I, url in enumerate (self. articles): print I, urldef main (): # You can change pongba to your blog name, or leave it blank, in this way, the default access is my blog csdn = CsdnCrawler (author = 'pongba ') # 'pongba' csdn. getPageLists () csdn. mytest () if _ name _ = '_ main _': main () <span style = "font-family: Verdana; font-size: 18px; "> </span>

Result:



126 data records are output.

 

Description of "abstract View": when a user's article contains multiple pages, the next page link on the directory view page is redirected to the next page link on the summary view. You may not understand this. For example.

Here is an example of how to use the blog of Liu weipeng senior (I worship): http://blog.csdn.net/pongba. He has many articles with pagination. Select the directory view on the page and go to the page Link. For example:

The pagination link value is:



We can see that the link of the next page is: http://blog.csdn.net +/pongba/article/list/2.

When we enter this URL in the browser and press enter, this result is displayed:



We can see that the first article is "The stochdale paradox and bottom line thinking"

However, when we use a program to open the results, however:



The result is the same as that on the second page of the summary view:


So, if you try to use. I couldn't understand why, and I had to worry about why the program went wrong for a long time. Later I switched to the abstract view.

 

To be continued.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.