Python3 Reptile (eight)--BeautifulSoup again crawl CSDN Blog

Source: Internet
Author: User
Tags save file

Order my Python3 Crawler (v) the task of crawling csdn all of the blog posts is achieved using UTLLIB basic functions and regular expression techniques. Links: Python3 Crawler (v)--single-threaded crawl my csdn all the blog post, we learned beautifulsoup such an excellent Python library, must be used effectively. Then we'll use BEAUTIFULSOUP4 to re-implement the task of crawling the CSDN blog.
As I modified the blog configuration, the homepage theme changed, we look at the page based on the new theme, as shown in:


Similarly, confirm the information to be extracted and the total number of pages in the blog.
Analysis page Source URL and the request header settings are the same as before, here is not verbose, mainly detailed how to use BEAUTIFULSOUP4 to obtain our target information, first look at the current Web page Source: Blog Information module:
Page Information module:


Extract page number of posts
    #求总页数    def getpages (self):        req = Urllib.request.Request (Url=self.url, headers=self.headers)        page = Urllib.request.urlopen (req)        # crawl content from my CSDN blog homepage is compressed content, first unzip        data = Page.read ()        data = ungzip (data)        data = Data.decode (' utf-8 ')        # gets BeautifulSoup object        soup = beautifulsoup (data, ' Html5lib ')        # count My posts Total pages        Tag = soup.find (' div ', ' pagelist ')        pagesdata = Tag.span.get_text ()        #输出392条  A total of 20 pages to find the numbers        Pagesnum = Re.findall (re.compile pattern=r ' total (. *?) Page '), pagesdata) [0]
As can be seen from the above code, when we read the Web page data, define the BeautifulSoup object, call the Find function can read "392 total 20 pages", and we want 20 this number, and then use the regular expression to extract it.
Extract post Information
#读取博文信息 def readdata (self): ret=[] req = Urllib.request.Request (Url=self.url, Headers=self.headers) res = Urllib.request.urlopen (req) # from my csdn blog home crawl content is compressed content, first extract data = Res.read () data = Ungzip (da TA) data = Data.decode (' utf-8 ') soup=beautifulsoup (data, "Html5lib") #找到所有的博文代码模块 items = soup. Find_all (' div ', ' list_item article_item ') for item in items: #标题, link, date, number of reads, number of comments title = Item.  Find (' span ', ' link_title '). A.get_text () link = item.find (' span ', ' link_title '). A.get (' href ') writetime = Item.find (' span ', ' link_postdate '). Get_text () Readers = Re.findall (re.compile (R ' \ (. *) \), item.find (' span ', "Link_view"). Get_text ()) [0] comments = Re.findall (Re.compile ((. *)                       \), item.find (' span ', "link_comments"). Get_text ()) [0] Ret.append (' Date: ' +writetime+ ' \ n title: ' +title + ' \ n ' link: http://blog.csdn.net ' +link + ' \N ' + ' read: ' +readers+ ' \ t comment: ' +comments+ ' \ n ') return ret 
As can be seen from the code, we extract the information of each element can be easily adopted beautifulsoup function, do not need to construct a complex regular expression, greatly simplifying the operation.
Complete code other operations are the same as before, do not repeat, the following is the complete code:
"PROGRAM:CSDN Blog crawler 2function: Using BeautifulSoup technology to achieve the date, subject, number of visits, comments on my CSDN home page All blog posts crawl Version:python 3.5.1time:2016/ 06/01autuor:yr ' Import urllib.request,re,time,random,gzipfrom BS4 import beautifulsoup# definition save file Function def saveFile (data,i ): Path = "E:\\projects\\spider\\06_csdn2\\papers\\paper_" +str (i+1) + ". txt" file = open (path, ' WB ') page = ' current page: ' +s        TR (i+1) + ' \ n ' file.write (Page.encode (' GBK ')) #将博文信息写入文件 (utf-8 saved file declared as GBK) for d in data:d = str (d) + ' \ n ' File.write (D.encode (' GBK ')) file.close () #解压缩数据def ungzip (data): Try: #print ("extracting ...") data = g Zip.decompress (data) #print ("Decompression complete ...") except:print ("Uncompressed, no decompression ...") return DATA#CSDN Reptile class Csdnspid Er:def __init__ (self,pageidx=1,url= "HTTP://BLOG.CSDN.NET/FLY_YR/ARTICLE/LIST/1"): #默认当前页 self.pageidx = Pageidx Self.url = Url[0:url.rfind ('/') + 1] + str (PAGEIDX) self.headers = {"Connection": "keep- Alive "," user-agent ":" Mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 "" (Khtml, like Gecko) chrome/51.0.2704.63 safari/537.36 "," Accept ":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," accept-encoding ":" gzip    , deflate, sdch "," Accept-language ":" zh-cn,zh;q=0.8 "," Host ":" Blog.csdn.net "} #求总页数 def getpages (self): req = Urllib.request.Request (url=self.url, headers=self.headers) page = urllib.request.u Rlopen (req) # from my csdn blog home crawl content is compressed content, first extract data = Page.read () data = ungzip (data) data = data. Decode (' Utf-8 ') # gets BeautifulSoup Object soup = beautifulsoup (data, ' Html5lib ') # count my posts total pages tag = so Up.find (' div ', ' pagelist ') Pagesdata = Tag.span.get_text () #输出392条 A total of 20 pages, find the number Pagesnum = Re.findall (Re.compile (. *?) (PATTERN=R) Page '), pagesdata) [0] return pagesnum #设置要抓取的博文页面 def setpage (SELF,IDX): SelF.url = Self.url[0:self.url.rfind ('/') +1]+str (idx) #读取博文信息 def readdata (self): ret=[] req = urllib.req Uest.         Request (Url=self.url, headers=self.headers) res = Urllib.request.urlopen (req) # The content crawled from my CSDN blog homepage is compressed content, first decompressed data = Res.read () data = ungzip (data) data = Data.decode (' utf-8 ') soup=beautifulsoup (data, "HT            Ml5lib ") #找到所有的博文代码模块 items = soup.find_all (' div '," List_item article_item ") for item in items: #标题, link, date, number of reads, number of comments title = Item.find (' span ', ' link_title '). A.get_text () link = item.find (' span ' , "Link_title"). A.get (' href ') Writetime = Item.find (' span ', "link_postdate"). Get_text () Readers = Re.f Indall (Re.compile (. *?) \), item.find (' span ', "Link_view"). Get_text ()) [0] comments = Re.findall (Re.compile ((. *)                       \), item.find (' span ', "link_comments"). Get_text ()) [0] Ret.append (' Date: ' +writetime+ ' \ n title: ' +title + ' \ n ChainAnswer: http://blog.csdn.net ' +link + ' \ n ' + ' read: ' +readers+ ' \ t comment: ' +comments+ ' \ n ') return ret# definition Crawler cs = Csdnspider () #求取pagesNum = Int (cs.getpages ()) Print ("Total pages of Posts:", pagesnum) for IDX in range (Pagesnum): cs.setpage (IDX) PR Int ("Current page:", idx+1) #读取当前页的所有博文, the result is list type papers = Cs.readdata () saveFile (PAPERS,IDX)

GitHub full Code link---please poke me.


Python3 Reptile (eight)--BeautifulSoup again crawl CSDN Blog

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.