Crawl Csdn blog Traffic with Python

Source: Internet
Author: User

Recently learned Python and crawler, want to write a program to practice practiced hand, so I think of people are more concerned about their own blog visits, using Python to get their own blog visits, which is the back of the project I will be part of, I will be in the back of the blog traffic Analysis, A line chart and pie charts and other visual way to show their blog is visited, so that they can be more aware of their own blog more attention, blog experts do not spray, because I am not an expert, I listen to them that the experts themselves have this function.

I. Web site analysis

Enter your own blog page, the URL is: Http://blog.csdn.net/xingjiarong URL is very clear is the CSDN URL + personal CSDN login account, we look at the next page of the URL.

See the address on page two: HTTP://BLOG.CSDN.NET/XINGJIARONG/ARTICLE/LIST/2
The number in the back indicates that it is now in the first page, and then the other page to verify, it is true, then the first page is not HTTP://BLOG.CSDN.NET/XINGJIARONG/ARTICLE/LIST/1, then we enter the HTTP in the browser ://BLOG.CSDN.NET/XINGJIARONG/ARTICLE/LIST/1 try, ah, sure enough is the first page ah, actually the first page is redirected, Http://blog.csdn.net/xingjiarong is redirected to HTTP ://BLOG.CSDN.NET/XINGJIARONG/ARTICLE/LIST/1, so two URLs can access the first page, and now the law is very obvious:
http://blog.csdn.net/xingjiarong/article/list/+ page Number

Second, how to get the title

Right-click on the source code of the Web page and we see that we can find a code that looks like this:

We can see that the headlines are all in the label

<span class="link_title"><a href="/xingjiarong/article/details/50651235">

So we can use the following regular expression to match the caption:

<span class="link_title"><a href=".*?">(.*?)</a></span>

Third, how to get access volume

Get the title, it is necessary to obtain the corresponding traffic, after the analysis of the source code, I see the structure of the traffic is like this:

<span class="link_view" title="阅读次数">   <a href="/xingjiarong/article/details/50651235" title="阅读次数">阅读</a>(1140)</span>

The number in parentheses is the amount of traffic, and we can match it with the following regular expression:

<span class="link_view".*?><a href=".*?" title="阅读次数">阅读</a>\((.*?)\)</span>

Iv. How to determine whether the last page

Next we want to determine whether the current page is the last page, otherwise we can not determine when the end, I found the source of the ' last ' label, found that the following structure:

<a href="/xingjiarong/article/list/2">下一页</a><a href="/xingjiarong/article/list/7">尾页</a>

So we can use the following regular expression to match, if the match is successful, it means the current page is not the last page, otherwise the current page is the last page.

<a href=".*?">尾页</a>

Five, the realization of programming

The following is the complete code implementation:

#!usr/bin/python#-*-Coding:utf-8-*-"' Created on February 13, 2016 @author:xingjiarong use Python to crawl csdn personal blog visits, mainly for practiced hand ' 'ImportUrllib2ImportRe#当前的博客列表页号Page_num =1#不是最后列表的一页Notlast =1account = str (raw_input (' Enter CSDN's login account: ')) whileNotlast:#首页地址BASEURL =' http://blog.csdn.net/'+account#连接页号, the page URLs that make up the crawlMyurl = baseurl+'/article/list/'+str (Page_num)#伪装成浏览器访问, direct access to the words csdn will refuseUser_agent =' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {' User-agent ': User_agent}#构造请求req = Urllib2. Request (Myurl,headers=headers)#访问页面Myresponse = Urllib2.urlopen (req) mypage = Myresponse.read ()#在页面中查找是否存在 ' last ' label to determine if it is the last pageNotlast = Re.findall (' <a href= '. > Last </a> ', Mypage,re. SPrint '-----------------------------page%d---------------------------------'% (Page_num,)#利用正则表达式来获取博客的标题title = Re.findall (' <span class= ' link_title "><a href=". *? > (. *?) </a></span> ', Mypage,re. S) titlelist=[] forItemsinchTitle:titleList.append (str (items). Lstrip (). Rstrip ())#利用正则表达式获取博客的访问量View = Re.findall (' <span class= ' Link_view '. *?><a href= ". *?" title= "read" > Read </a>\ ((. *?) \) </span> ', Mypage,re. S) viewlist=[] forItemsinchView:viewList.append (str (items). Lstrip (). Rstrip ())#将结果输出     forNinchRange (len (titlelist)):Print ' Traffic:%s ' title:%s '% (Viewlist[n].zfill (4), Titlelist[n])#页号加1Page_num = Page_num +1

Here are some of the results:

InputcsdnThe login account: Xingjiarong-----------------------------1th page---------------------------------Number of visits: 1821Title:p YthonProgramming Common Template Summary traffic: 1470Title: Design patternUML(i) Class diagrams and inter-class relationships (generalization, implementation, dependency, association, aggregation, composition) Access volume: 0714Title: Ubuntu14. GenevaInstall and crackMyEclipse2014Number of visits: 1040Title: Ubuntu14. GenevaConfigurationtomcat8Number of visits: 1355Title: JavaCallpythonMethod Summary Access Volume: 0053Title: JavaMultiple threadscallableAnd FutureNumber of visits: 1265Title: Learn from Me Assembly (c) the formation of registers and physical addresses of traffic: 1083Title: Learn from Me compilation (ii) Wang Shuang compilation Environment building traffic: 0894Title: Learn from Me compilation (i) Basic knowledge traffic: 2334Title: JavaMultithreading (i)Race ConditionThe phenomenon and the resulting cause of traffic: 0700Title: MatlabMatrix Base Access Volume: 0653Title: MatlabVariable, branch statement, and Loop statement access volume: 0440Title: MatlabString processing Access Volume: 0514Title: Matlaboperator and Operation Access Amount: 0533Title: MatlabThe data type-----------------------------2nd page---------------------------------Number of visits: 0518Title: OpenStackDesign and Implementation (v)RESTful APIAndWSGINumber of visits: 0540Title: ResolvingAndroid SDK ManagerDownload too slow problem traffic: 0672Title: OpenStackDesign and implementation (iv) message bus (AMQP) Number of visits: 0570Title: Distributed File StorageFastdfsFiveFastdfsGeneral Command Summary Access volume: 0672Title: Distributed File StorageFastdfs(iv) configurationFastdfs-apache-moduleNumber of visits: 0979Title: Distributed File StorageFastdfs(a) Initial knowledgeFastdfsNumber of visits: 0738Title: Distributed File StorageFastdfsThreeFastdfsConfigure Access Volume: 0682Title: Distributed File StorageFastdfsTwoFastdfsInstallation traffic: 0511Title: OpenStackDesign and implementation (III.)KVMAndQEMUAnalysis of Traffic volume: 0593Title: OpenStackDesign and implementation (II.)LibvirtIntroduction and implementation principle visits: 0562Title: OpenStackDesign and implementation (i) Virtual traffic: 0685Title: Dining Hall Buy food revelation traffic: 0230Title: UMLThe timing diagram of the detailed access volume: 0890Title: The difference between the bridge mode and the strategy mode of the design pattern: 1258Title: Design Mode (12) Responsibility chain mode

Summarize:

Using Python to write a crawler, I personally summed up the following steps:

1, analyze the URL features to crawl to determine how to generate the relevant web page URL, if only crawl a page, then this step can be omitted.

2. View the source code of the Web page and analyze the characteristics of the tag that you want to crawl.

3. Use regular expressions to pull out the part you want from the source.

4, programming implementation.

Crawl Csdn blog Traffic with Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.