Crawl blog information through Python _

Crawl blog information through Python __python

Last Update:2018-07-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently wrote a blog, suddenly want to know their blog reading trends are how, helpless csdn does not provide this function. It was supposed to be an occasional manual check, log into the Excel table and then after a while to know the approximate, but as a programmer can hand in a program to automate the implementation of the original method, do not use the Internet a search is really like me "boring" want to do this statistic, coupled with the recent study of Python, So I decided to write a Python crawler to crawl this information. (Helpless Python level is too weak, for some efficient use of Python, string segmentation is not used, so the use of some very low syntax implementation, to be followed by the improvement of Python language ability should be able to modify the garbage syntax, you can see a ballpark, reference article ( The crawler rules for this article are no longer applicable with the new version of the CSDN page, but provide a good idea for ): "Python script"-Crawler gets CSDN blog article visits and comments)

#!usr/bin/python #-*-Coding:utf-8-*-from urllib Import request from BS4 import beautifulsoup account = "u011552404" BaseURL = ' http://blog.csdn.net ' use Urllib to get HTML page ' def getpage (URL): user_agent = ' mozilla/4.0 (compatible; M SIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} # Disguised as a browser access req = Request. Request (URL, headers=headers) Response = Request.urlopen (req) page = response.read () return page "' Get the page number of articles : 1, the purpose is in the next page crawl can crawl each page of the data 2, the number of pages in the CSDN under the law, that is, the last number of page display area is total page ' def getpagecount (URL): page = getpage (URL) so up = BeautifulSoup (Page, ' Html.parser ', from_encoding= ' Utf-8 ') uses BeautifulSoup to parse XML to form a special tag object papelist = Soup (clas s_= "Page-link") #页码显示区域 numberlist = papelist[-2] #提取页码显示区域数据 res = str (numberlist). Split (' < ') [ -2].split (' > ') [ 
   -1] #提取页码数 return res ' extracts all the articles and reads 1, see BeautifulSoup's Tutorial Understanding 2, the core principle is obtains + divides + obtains the key word, specifically how extracts please csdn the developer Mode Debugging console (F12) on the page, Summary of all crawled information on the page position, element characteristics, suggest a certain front-end HTML experience 3, extracted after the TXT document
' Def getarticledetails (): Myurl = BaseURL + '/' + account Page_sum_number = Getpagecount (myurl) print ("Pa
    Genumber ", page_sum_number) Cur_page_num = 1 linklist = [] Titlelist = [] Datelist = [] ReadList = [] While cur_page_num <= Int (page_sum_number): url = myurl + '/article/list/' + str (cur_page_num) # CSDN per page blog URL address format myPage = getpage (URL) soup = BeautifulSoup (myPage, ' Html.parser ', from_encoding= ' utf-8 ') #解析当前XML页 Face for Soup object Print (soup) for blog_list in Soup.find_all (class_= "Blog-unit"): #CSDN博客中目录页每篇博客所在的元素class统一为 Blog -unit ", so navigate to the blog Information list print (" Blog_list ", blog_list.contents) link_elment = blog_list.contents[1 by this keyword ] #第一个元素即为链接地址所在 link = link_elment[' href '].strip () # Extract link address print ("link", link) name_  Elment = link_elment.contents[1] #链接地址所在元素的第一个元素又是名称元素 name = str (name_elment). Split (' \ n ') [ -1].split (' \ t ') [-3]
            # Extract blog post namePrint ("name", name) linklist.append (link) titlelist.append (name) ' "' date and number of readings here appears Unexpected situation: mingming date and reading number of the element area is separated from the previous blog title area, the results of the resolution of the date and blog reading as a blog_list (sub area), where the keyword class_= "Floatl left-d Is-24 ", but because this keyword area also retrieves three information: date, reading number, comment number so, a loop at the same time to extract three information, so for the in statement can not use, the list of index pointer to move in the form of each fetch three information (the actual code Demo only extracts two "' list = Soup.find_all (class_=" Floatl left-dis-24 ") i = 0 while i < Len (li ST): Date_info = list[i] Print ("Date_info", type (date_info), date_info) date = str (dat
            E_info). Split (' < ') [1].split (' > ') [-1] Print ("date", date) datelist.append (date); i = i + 1 read_info = list[i] Print ("Read_info", read_info) Read_span = Read_info.fin
            D (' span ') Read = str (read_span). Split (' > ') [1].split (' < ') [0] Print ("read", read)
   Readlist.append (Read)         i = i + 2 # Skip Next information Extract Cur_page_num = Cur_page_num + 1 f = open ("./read_count.txt", "A +") for I  Range (0, Len (titlelist)): #string = titlelist[i] + ' t ' + linklist[i] + ' t ' + datelist[i] + ' t ' + readlist[i] + ' \ t ' + ' \ n ' string = readlist[i] + ' \ n ' f.write (str (string)) F.write (' \ n ') f.close () if __name_ _ = = "__main__": Getarticledetails ()

Detailed comments have been written in the code, and are well understood, accompanied by Github address: Github/bloginfospider. The
also needs to be noted: to modify the crawl mechanism by using specific element information from the page you want to crawl. The following are partial crawl results (title, address, time, number of readings):

netstat command output Analysis https://blog.csdn.net/u011552404/article/details/51130936 2016-04-12  11:15:28 5916 64-bit and 32-bit difference https://blog.csdn.net/u011552404/article/details/50942783 2016-03-21 10:02:22 510 Routing related knowledge https://blog.csdn.net/u011552404/article/details/50917408 2016-03-17 21:45:48 422 4G module zhongxing ME3760 Debug Record HTTPS://BLOG.CSD n.net/u011552404/article/details/50865836 2016-03-12 17:03:29 5433 Select () function and Fd_set https://blog.csdn.net/u011552 404/article/details/50828029 2016-03-08 16:12:54 264 understanding of the socket function https://blog.csdn.net/u011552404/article/details/  50827968 2016-03-08 16:07:59 254 The principle of RTP H264 https://blog.csdn.net/u011552404/article/details/50814532 2016-03-06    18:30:53 205 RTP Timestamp https://blog.csdn.net/u011552404/article/details/50814433 2016-03-06 18:03:09 262 Raspberry pie using USB camera https://blog.csdn.net/u011552404/article/details/50807741 2016-03-05 11:22:26 8425

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawl blog information through Python __python

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support