Python crawler csdn Series III
by white Shinhuata (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.
Description:
In the previous blog, we have been able to get a link to all of the user's articles, then this section is naturally to download these blogs down.
Analysis:
With the link to download the article is naturally not difficult. But what should I do with the data we get? Each article is formatted to wrap this information, and naturally, we store them also to store their corresponding HTML-formatted data (note that the formatted blog or other text we edit is stored in HTML code format). How to save? The use of the database, each article is very large number of words, no need to use the database, or storage files more convenient.
Here I will download each blog to take out the div part of the article, and then add the necessary HTML header and its tail to this section, wrap it into a full HTML text, and then save the file in HTML format. Here to pay attention to is in the HTML must add <meta http-equiv= "Content-type" content= "text/html;charset=utf-8"/> This sentence, otherwise the display will be garbled.
The core code in the first two articles has been mentioned, the difficulty is not very large.
Code:
#-*-coding:utf-8-*-import sysimport osimport codecsimport urllibimport urllib2import cookielibimport MySQLdbimport Refrom BS4 Import beautifulsoupfrom article import csdnarticlereload (SYS) sys.setdefaultencoding (' Utf-8 ') class Csdncrawler (object):d EF __init__ (self, author = ' whiterbear '): Self.author = Authorself.domain = ' http://blog.csdn.net/ ' self.headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/31.0.1650.57 Sa fari/537.36 '}self.articles = [] #给定url, get all the articles Listsdef getarticlelists (self, url= None): req = urllib2. Request (URL, headers=self.headers) response = Urllib2.urlopen (req) soup = BeautifulSoup (". Join" (Response.read ())) ListItem = Soup.find (id= ' article_list '). Find_all (attrs={' class ': R ' List_item Article_item '}) Href_regex = R ' href= ' (. * ?)"' For I,item in Enumerate (listitem): Enitem = Item.find (attrs={' class ': ' Link_title '}). Contents[0].contents[0]href = Re.search (Href_regex,str (Item.find (attrs={' class ': ' Link_title '}). Contents[0]). Group (1) ARt = csdnarticle () Art.author = Self.authorart.title = Enitem.lstrip () Art.href = (Self.domain + href[1:]). Lstrip () Self.articles.append (ART) def getpagelists (self, url= None): url = URL If url Else ' http://blog.csdn.net/%s?viewmode= List '%self.authorreq = Urllib2. Request (URL, headers=self.headers) response = Urllib2.urlopen (req) soup = BeautifulSoup (". Join (Response.read ())) num_ Regex = ' [1-9]\d* ' pagelist = Soup.find (id= ' papelist ') self.getarticlelists (URL) if pagelist:pagenum = Int (Re.findall ( Num_regex, Pagelist.contents[1].contents[0]) [1]) for I in range (2, Pagenum + 1): self.getarticlelists (Self.domain + Self.author + '/article/list/%s '%i) def getallarticles (self): #我们创建一个该作者的文件夹来存放作者的文章if not os.path.exists (self.author ): Os.mkdir (Self.author) for subarticle in Self.articles:articleurl = subarticle.href# opens each article in turn and downloads req = Urllib2. Request (Articleurl, headers=self.headers) response = Urllib2.urlopen (req) soup = BeautifulSoup (". Join (Response.read ( )) Article_content = Soup.find (id= ' article_content ') title= Subarticle.title.rstrip (). Encode (' utf-8 ') #将提取的内容封装成html格式的字符串article = U '
Results:
A folder with the blogger's name is generated, and the folder contains all of the bloggers ' articles. Such as:
Open an article casually, showing the effect is this: (The effect is very good.) )
sentiment:
1> Chinese coding problem. Although several coding problems have been understood, they are often stuck in this problem.
2> preserves the orthogonality of the code. Although I have not done a large project yet, I can already feel that if the orthogonality of the two modules is improved, the change of one module will not affect the normal operation of the other module. This can force you to think of a clear frame without writing a messy code.
3> common illusion, always feel this is very simple ah, today can finish Ah, the result is always encountered such a problem, or lack of experience.
4> Other: Keep the code neat, try iterating, start with a little bit of code to accumulate new code, keep two versions at all times (one with a lot of output to help you determine what happens at each step).
The next series may be started to do micro-blog crawler, will involve related data processing and analysis, hope to be smooth point.
Python Crawler CSDN Series III