Python3 Reptile (eight)--BeautifulSoup again crawl CSDN Blog

Last Update:2016-06-02 Source: Internet

Author: User

Tags save file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Order my Python3 Crawler (v) the task of crawling csdn all of the blog posts is achieved using UTLLIB basic functions and regular expression techniques. Links: Python3 Crawler (v)--single-threaded crawl my csdn all the blog post, we learned beautifulsoup such an excellent Python library, must be used effectively. Then we'll use BEAUTIFULSOUP4 to re-implement the task of crawling the CSDN blog.
As I modified the blog configuration, the homepage theme changed, we look at the page based on the new theme, as shown in:

Similarly, confirm the information to be extracted and the total number of pages in the blog.
Analysis page Source URL and the request header settings are the same as before, here is not verbose, mainly detailed how to use BEAUTIFULSOUP4 to obtain our target information, first look at the current Web page Source: Blog Information module:
Page Information module:

Extract page number of posts

    #求总页数    def getpages (self):        req = Urllib.request.Request (Url=self.url, headers=self.headers)        page = Urllib.request.urlopen (req)        # crawl content from my CSDN blog homepage is compressed content, first unzip        data = Page.read ()        data = ungzip (data)        data = Data.decode (' utf-8 ')        # gets BeautifulSoup object        soup = beautifulsoup (data, ' Html5lib ')        # count My posts Total pages        Tag = soup.find (' div ', ' pagelist ')        pagesdata = Tag.span.get_text ()        #输出392条  A total of 20 pages to find the numbers        Pagesnum = Re.findall (re.compile pattern=r ' total (. *?) Page '), pagesdata) [0]

As can be seen from the above code, when we read the Web page data, define the BeautifulSoup object, call the Find function can read "392 total 20 pages", and we want 20 this number, and then use the regular expression to extract it.
Extract post Information

#读取博文信息 def readdata (self): ret=[] req = Urllib.request.Request (Url=self.url, Headers=self.headers) res = Urllib.request.urlopen (req) # from my csdn blog home crawl content is compressed content, first extract data = Res.read () data = Ungzip (da TA) data = Data.decode (' utf-8 ') soup=beautifulsoup (data, "Html5lib") #找到所有的博文代码模块 items = soup. Find_all (' div ', ' list_item article_item ') for item in items: #标题, link, date, number of reads, number of comments title = Item.  Find (' span ', ' link_title '). A.get_text () link = item.find (' span ', ' link_title '). A.get (' href ') writetime = Item.find (' span ', ' link_postdate '). Get_text () Readers = Re.findall (re.compile (R ' \ (. *) \), item.find (' span ', "Link_view"). Get_text ()) [0] comments = Re.findall (Re.compile ((. *)                       \), item.find (' span ', "link_comments"). Get_text ()) [0] Ret.append (' Date: ' +writetime+ ' \ n title: ' +title + ' \ n ' link: http://blog.csdn.net ' +link + ' \N ' + ' read: ' +readers+ ' \ t comment: ' +comments+ ' \ n ') return ret

As can be seen from the code, we extract the information of each element can be easily adopted beautifulsoup function, do not need to construct a complex regular expression, greatly simplifying the operation.
Complete code other operations are the same as before, do not repeat, the following is the complete code:

"PROGRAM:CSDN Blog crawler 2function: Using BeautifulSoup technology to achieve the date, subject, number of visits, comments on my CSDN home page All blog posts crawl Version:python 3.5.1time:2016/ 06/01autuor:yr ' Import urllib.request,re,time,random,gzipfrom BS4 import beautifulsoup# definition save file Function def saveFile (data,i ): Path = "E:\\projects\\spider\\06_csdn2\\papers\\paper_" +str (i+1) + ". txt" file = open (path, ' WB ') page = ' current page: ' +s        TR (i+1) + ' \ n ' file.write (Page.encode (' GBK ')) #将博文信息写入文件 (utf-8 saved file declared as GBK) for d in data:d = str (d) + ' \ n ' File.write (D.encode (' GBK ')) file.close () #解压缩数据def ungzip (data): Try: #print ("extracting ...") data = g Zip.decompress (data) #print ("Decompression complete ...") except:print ("Uncompressed, no decompression ...") return DATA#CSDN Reptile class Csdnspid Er:def __init__ (self,pageidx=1,url= "HTTP://BLOG.CSDN.NET/FLY_YR/ARTICLE/LIST/1"): #默认当前页 self.pageidx = Pageidx Self.url = Url[0:url.rfind ('/') + 1] + str (PAGEIDX) self.headers = {"Connection": "keep- Alive "," user-agent ":" Mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 "" (Khtml, like Gecko) chrome/51.0.2704.63 safari/537.36 "," Accept ":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," accept-encoding ":" gzip    , deflate, sdch "," Accept-language ":" zh-cn,zh;q=0.8 "," Host ":" Blog.csdn.net "} #求总页数 def getpages (self): req = Urllib.request.Request (url=self.url, headers=self.headers) page = urllib.request.u Rlopen (req) # from my csdn blog home crawl content is compressed content, first extract data = Page.read () data = ungzip (data) data = data. Decode (' Utf-8 ') # gets BeautifulSoup Object soup = beautifulsoup (data, ' Html5lib ') # count my posts total pages tag = so Up.find (' div ', ' pagelist ') Pagesdata = Tag.span.get_text () #输出392条 A total of 20 pages, find the number Pagesnum = Re.findall (Re.compile (. *?) (PATTERN=R) Page '), pagesdata) [0] return pagesnum #设置要抓取的博文页面 def setpage (SELF,IDX): SelF.url = Self.url[0:self.url.rfind ('/') +1]+str (idx) #读取博文信息 def readdata (self): ret=[] req = urllib.req Uest.         Request (Url=self.url, headers=self.headers) res = Urllib.request.urlopen (req) # The content crawled from my CSDN blog homepage is compressed content, first decompressed data = Res.read () data = ungzip (data) data = Data.decode (' utf-8 ') soup=beautifulsoup (data, "HT            Ml5lib ") #找到所有的博文代码模块 items = soup.find_all (' div '," List_item article_item ") for item in items: #标题, link, date, number of reads, number of comments title = Item.find (' span ', ' link_title '). A.get_text () link = item.find (' span ' , "Link_title"). A.get (' href ') Writetime = Item.find (' span ', "link_postdate"). Get_text () Readers = Re.f Indall (Re.compile (. *?) \), item.find (' span ', "Link_view"). Get_text ()) [0] comments = Re.findall (Re.compile ((. *)                       \), item.find (' span ', "link_comments"). Get_text ()) [0] Ret.append (' Date: ' +writetime+ ' \ n title: ' +title + ' \ n ChainAnswer: http://blog.csdn.net ' +link + ' \ n ' + ' read: ' +readers+ ' \ t comment: ' +comments+ ' \ n ') return ret# definition Crawler cs = Csdnspider () #求取pagesNum = Int (cs.getpages ()) Print ("Total pages of Posts:", pagesnum) for IDX in range (Pagesnum): cs.setpage (IDX) PR Int ("Current page:", idx+1) #读取当前页的所有博文, the result is list type papers = Cs.readdata () saveFile (PAPERS,IDX)

GitHub full Code link---please poke me.

Python3 Reptile (eight)--BeautifulSoup again crawl CSDN Blog

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More