Use python to crawl csdn blog visits

Source: Internet
Author: User

Use python to crawl csdn blog visits

I recently learned python and crawlers and want to write a program trainer. So I thought of my blog visits that everyone cares about. I used python to get access to my blog, this is also part of the project that will be carried out later. I will analyze the access volume of my blog and display the access status of my blog in a visual way, such as line chart and pie chart, this allows you to better understand which blogs you are more interested in. Blog experts are not recommended because I am not an expert. I heard them say that experts themselves have this function.

I. Website Analysis

Go to your blog page. The website is still very clear: the csdn website + Your csdn Logon account. Let's take a look at the website on the next page.

The address on the second page is: http://blog.csdn.net/xingjiarong/article/list/2
The number at the back of the page indicates the page on which it is currently on. Verify it with other pages. Why is this true:
Http://blog.csdn.net/xingjiarong/article/list/ + page number

Ii. How to obtain the title

Right-click to view the source code of the webpage. We can find the following code:

We can see that the title is in the label

 

Therefore, we can use the following regular expression to match the title:

(http://blog.csdn.net/xingjiarong/article/details/.*?)

3. How to obtain access traffic

After obtaining the title, we will get the corresponding access volume. After analyzing the source code, I can see that the access volume structure is like this:

Reading (1140)

The number in the brackets is the traffic. We can use the following regular expression to match:

Read \ (http://blog.csdn.net/xingjiarong/article/details /.*?) \)

4. How to determine whether the last page is used

Next, we will determine whether the current page is the last page. Otherwise, we will not be able to determine when it will end. I found the 'tail' tag in the source code and found the following structure:

Therefore, we can use the following regular expression for matching. If the matching succeeds, the current page is not the last page; otherwise, the current page is the last page.

V. Programming implementation

The complete code implementation is as follows:

#! Usr/bin/python #-*-coding: UTF-8-*-''' Created on August 1, February 13, 2016 @ author: xingjiarong uses python to crawl the access volume of the csdn personal blog, it is mainly used to train the operator '''import urllib2import re # current blog list page number page_num = 1 # notLast = 1 account = str (raw_input ('Enter the login account of csdn: ') while notLast: # Homepage Address baseUrl = 'HTTP: // blog.csdn.net/'your Account # connection page number, constitute the crawled page url myUrl = baseUrl + '/article/list/' + str (page_num) # disguise as a browser access, csdn rejects user_agent = 'mozilla/4.0 (compa Tible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} # construct the request req = urllib2.Request (myUrl, headers = headers) # access page myResponse = urllib2.urlopen (req) myPage = myResponse. read () # Check whether the 'tail' tag exists on the page to determine whether it is the last page notLast = re. findall ('Last page', myPage, re. s) print '------------------------------- Page % d ---------------------------------' % (page_num,) # use regular expressions to obtain the title of a blog = re. findall ('(http://blog.csdn.ne T/xingjiarong/article/details /.*?) ', MyPage, re. s) titleList = [] for items in title: titleList. append (str (items ). lstrip (). rstrip () # obtain the blog's access view = re using a regular expression. findall ('read \ (http://blog.csdn.net/xingjiarong/article/details /.*?) \) ', MyPage, re. s) viewList = [] for items in view: viewList. append (str (items ). lstrip (). rstrip () # output the result for n in range (len (titleList): print 'traffic: % s title: % s' % (viewList [n]. zfill (4), titleList [n]) # Add 1 page_num = page_num + 1

The following are some results:

Enter the Logon account of csdn: xingjiarong ----------------------------- page 1st ------------------------------- access volume: 1821 title: python programming common template summary access volume: 1470 title: UML (1) of Design Mode) access volume of class charts and inter-class relationships (generalization, implementation, dependency, association, aggregation, and combination): 0714 title: ubuntu14.04 install and crack MyEclipse2014 access volume: 1040 title: ubuntu14.04 configure atat8 access volume: 1355 title: java calling python method summary access volume: 0053 title: Callable and Future of Java multithreading access volume: 1265 title: generate access volume of register and physical address: 1083 title: analytics compilation (ii) Wang Shuang compilation environment build access volume: 0894 title: Analytics compilation (I) Basic knowledge access volume: 2334 title: java multithreading (I) Race Condition phenomenon and cause access volume: 0700 title: Matlab matrix base access volume: 0653 title: Matlab variables, branch statements, and cyclic statements access volume: 0440 title: Matlab string processing access volume: 0514 title: Matlab operator and computing access volume: 0533 title: Matlab data type ------------------------------- 2nd page ------------------------------- access volume: 0518 title: OpenStack design and implementation (5) RESTful API and WSGI access volume: 0540 title: solve the problem of slow download of Android SDK Manager. Access volume: 0672 title: OpenStack design and implementation (4) Message bus (AMQP) Access volume: 0570 title: Distributed File Storage FastDFS (5) fastDFS Common commands summary access volume: 0672 title: Distributed File Storage FastDFS (4) configuration fastdfs-apache-module Access volume: 0979 title: Distributed File Storage FastDFS (1) first recognized FastDFS access volume: 0738 title: Distributed File Storage FastDFS (iii) FastDFS configuration access: 0682 title: Distributed File Storage FastDFS (ii) FastDFS installation access: 0511 title: OpenStack design and implementation (III) KVM and QEMU access volume analysis: 0593 title: OpenStack design and implementation (ii) Libvirt overview and implementation principle access volume: 0562 title: OpenStack design and implementation (I) virtualization access volume: 0685 title: inspiration for dining room food visit: 0230 title: UML time sequence diagram detailed views: 0890 title: Bridge Mode of design mode and different access mode: 1258 title: design mode (12) responsibility Chain Model

Summary:

Using python to write crawlers, I personally summarized the following steps:

1. analyze the characteristics of the web site to be crawled to determine how to generate the web site of the relevant web page. If only one web page is crawled, this step can be omitted.

2. view the source code of the web page and analyze the features of the tag of the content you want to crawl.

3. Use a regular expression to extract the desired part from the source code.

4. Programming.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.