Python crawler csdn Series II

Source: Internet
Author: User

Python crawler csdn Series II



by white Shinhuata (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.


Description:

In the previous article, we already knew that the CSDN Web page could be accessed as long as the program was disguised as a browser. In this article, we'll try to get links to all the articles of a csdn user.

Analysis:

To open a column of one of the CSDN users, you can select a catalog view (for example: http://blog.csdn.net/whiterbear?viewmode=contents) and a summary view (for example: http://blog.csdn.net/ Whiterbear?viewmode=list). Two views can display a list of users ' articles.

Note: Here we choose the Summary view, do not select catalog view, the article will explain why at the end.

Open the Summary view to view the Web page source code, we find that in the div with id ' article_list ', each sub-div represents an article,



Each sub-Div contains the title of an article, links, reading times, whether it is original, number of comments and other information, we just need to remove the title and link is enough. How to take out is not difficult, learn the regular expression should be. We then use the array to save all the article names and their links in the interface.

It is important to note that if a blog has pagination, we also need to get a link to the article on the next page of the page?

I tried two methods, the first method was to set a article_list dictionary, the dictionary member is ' The next page link and whether it has been accessed identity key value pairs ', initially put in the first page link, each interface when the value of the link is set to access, and then find the next page of the link, if it is not in the dictionary, it is added and set access identifier 0.

For example, the initial dictionary is article_list={'/PONGBA/ARTICLE/LIST/1 ': False}, in the process of/PONGBA/ARTICLE/LIST/1 this interface is set to True, this time again found/pongba/ ARTICLE/LIST/2 and/PONGBA/ARTICLE/LIST/3. At this point, we judge that the dictionary (Has_key ()) does not have these two keys, join, and set its value to false. After traversing the (keys ()) dictionary, if there is a link with a value of 0, the link is accessed and repeated.

The second method: in the pagination of the HTML code gives the page pagination, we extract the number of pages pagenum, combined with/pongba/article/list/num,num for the page, the value is [1,pagenum]. This link allows you to remove all of the author's articles.

I used the second method in the code, the first method tried, also can.


Code Description:

The Csdnarticle Class (article.py), encapsulated into an article must be something that facilitates saving and accessing the properties of an article.

I rewrote the __str__ () method for easy input.

#-*-coding:utf-8-*-class csdnarticle (object):d EF __init__ (self): #作者self. Author = "#博客文章名self. title = ' # Blog link self.href = ' #博客内容self. Body = ' #字符串化def __str__ (self): return self.author + ' \ t ' + self.title + ' \ t ' + self.href + ' \ T ' + self.body

Csdncrawler class. Encapsulates the action of crawling all links to the CSDN blog.

#-*-coding:utf-8-*-import sysimport urllibimport urllib2import refrom BS4 import beautifulsoupfrom article import CsdnAr Ticlereload (SYS) sys.setdefaultencoding (' Utf-8 ') class Csdncrawler (object): #默认访问我的博客def __init__ (self, author = ' Whiterbear '): Self.author = Authorself.domain = ' http://blog.csdn.net/' self.headers = {' user-agent ': ' mozilla/5.0 ( Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/31.0.1650.57 safari/537.36 '} #存储文章对象数组self. articles = []# Given the URL, get all the articles Listsdef getarticlelists (self, url= None): req = urllib2. Request (URL, headers=self.headers) response = Urllib2.urlopen (req) soup = BeautifulSoup (". Join" (Response.read ())) ListItem = Soup.find (id= ' article_list '). Find_all (attrs={' class ': R ' List_item Article_item '}) #链接的正则表达式, can match links href_ Regex = R ' href= ' (. *?) "' For I,item in Enumerate (listitem): Enitem = Item.find (attrs={' class ': ' Link_title '}). Contents[0].contents[0]href = Re.search (Href_regex,str (Item.find (attrs={' class ': ' Link_title '}). Contents[0]). Group (1) #我们将获取的一篇文章信息封装成一个对象, and then save in the array art = csdnarticle () Art.author = Self.authorart.title = Enitem.lstrip () Art.href = (Self.domain + href[1:]). Lstrip () Self.articles.append (ART) def getpagelists (self, url= None): url = ' http://blog.csdn.net/%s?viewmode=list '% Self.authorreq = Urllib2. Request (URL, headers=self.headers) response = Urllib2.urlopen (req) soup = BeautifulSoup (". Join (Response.read ())) num_ Regex = ' [1-9]\d* ' pagelist = Soup.find (id= ' papelist ') self.getarticlelists (URL) #如果该作者博客多, pagination if Pagelist:pagenum = Int (Re.findall (Num_regex, Pagelist.contents[1].contents[0]) [1]) for I in range (2, Pagenum + 1): Self.getarticlelists ( Self.domain + self.author + '/article/list/%s '%i) def mytest (self): for I,url in Enumerate (self.articles):p rint i,urldef Main (): #可以将pongba换成你的博客名, also can not fill, is empty, so the default is to access my blog Csdn = Csdncrawler (author= ' Pongba ') # ' Pongba ' csdn.getpagelists () Csdn.mytest () if __name__ = = ' __main__ ': Main () <span style= "FONT-FAMILY:VERDANA;FONT-SIZE:18PX;" ></span>

Results:



Output of 126 data.

Select the explanation for the summary view: When a user article is paginated, accessing the next page link in the catalog view interface jumps to the next link in the summary view. You may not understand that, for instance.

I use the Liu Weipeng seniors (I adore) blog for example, Address: Http://blog.csdn.net/pongba. He has a lot of articles and pagination. After selecting the catalog view in his interface, turn to page link. Such as:

The paging link values are:



You can see that the link to the next page is: Http://blog.csdn.net +/PONGBA/ARTICLE/LIST/2.

This result occurs when we enter this URL in the browser:



Can see the first article is "Stockdale paradox and Bottom line thinking method"

However, when we use the program to open the result is:



And the result is the same as the second page of the summary view:


So, if you try to use the HTTP://BLOG.CSDN.NET/PONGBA/ARTICLE/LIST/2 link to access the program, the results are not the result of the catalog view. I did not understand why, tangled up a long time the program why wrong, and later replaced by a summary view.

Not to be continued.


Python crawler csdn Series II

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.