Motivation
Today, a friend wrote that he thought his WordPress blog to display the access statistics are not normal, I hope I can create some access information for him, for him to compare. A friend's request is to quickly open 100 different blog pages in a short period of time so that he can understand the blog access data from the changes in the amount of traffic that has been generated.
I as a computer man, have put any repetitive labor automation impulse, so although the task of opening 100 pages of manual do not complicate, but still from the beginning completely denied. Just want to learn python for a long time, so take this small opportunity to learn a, by the way record the first time of Python learning results.
This article uses Python 2.7.3 to implement a script that automatically accesses the blog, involving the following technical points:
- Language Basics
- Container (linear table, dictionary)
- Logical Branch, loop
- Console formatted output
- HTTP Client Network programming
- Handling HTTP Requests
- Using an HTTP proxy server
- Python Regular Expressions
General overview
Automatically access the blog page This operation is actually similar to what the web crawler does, the basic flow is as follows:
Figure 1 How the blog auto-accessor works
- Give the accessor a starting position (for example, the blog home url)
- The accessor crawls the page that the URL points to (crawling back to the page is itself the equivalent of opening the page in a browser)
- The page crawled back in 2 is given to the parser for analysis. The parser parses the URL into a list of URLs to be accessed, that is, the URL library. Then remove the next page URL from the URL library to access
- Loop 2, 3 steps until a termination condition is reached the program exits
When we first started programming, we had nothing but a URL for the home page of a blog. The problem that follows this URL is how to programmatically crawl the page that the URL points to. Crawl to the page, in order to be considered a blog access to the content of the page, so as to extract more URLs, to further crawl.
This brings up the question of how to get the page information based on the URL, to solve this problem, you need to use the HTTP client programming, which is the next section to solve the problem.
Urllib2:http Client Programming
There are several libraries in Python that can implement HTTP client programming, such as Httplib, Urllib, Urllib2, and so on. URLLIB2 is used because it is powerful, simple to use, and can be easily used by HTTP proxies.
Using URLLIB2 to create an HTTP connection and get the page is very simple and requires only 3 steps:
Import Urllib2
Opener = Urllib2.build_opener ()
File = Opener.open (URL)
Content = File.read ()
Content is the style of the HTTP request response, which is the HTML code of the Web page. If you need to set up a proxy, pass in a parameter when calling Build_opener ():
?
opener = urllib2.build_opener(urllib2.ProxyHandler({ ‘http‘ : "localhost:8580" })) |
The Proxyhandler function accepts a dictionary-type parameter, where key is the protocol name and value is the host and port number. Agents with authentication are also supported, see official documentation for more details.
The next problem to solve is to detach the URL from the page. This requires a regular expression.
Regular expressions
Regular expression-related functions are located in the Python re module and need to be import re before use
The FindAll function returns all substrings in a string that satisfy a given regular type:
Aelems = Re.findall ('<a href= ' .*<\/a>', content)
The first parameter of the FindAll is a regular, the second argument is a string, the return value is an array of strings, and contains all the substrings in the content that satisfy the given regular style. The above code returns all substrings ending with <a href= "beginning with,</a>, that is, all <a> tags. Apply this filter to the Web page HTML code to get all the hyperlinks. If you need to further improve the accuracy of the filter, for example only need to link to this blog (assuming the current blog is http://myblog.wordpress.com), and the URL is an absolute address, you can use more precise regular, such as ' <a href= ' http\:\/ \/myblog\.wordpress\.com.*<\/a> '.
Get the <a> tag, you need to further extract the URL, the match function is recommended here. The function of the match function is to return part of a substring that satisfies the specified regular form. For example, the following string (assuming that it exists in the Aelem element):
<a href= "Http://myblog.wordpress.com/rss" >rss feed</a>
If you need to extract the URL (that is , Http://myblog.wordpress.com/rss), you can call the match as follows:
matches = Re.match ('<a href= ' (. *) "', Aelem)
When the match succeeds, match returns the Matchobject object, otherwise none is returned. For Matchobject, you can use the groups () method to get all the elements contained in it, or you can get through group (subscript), noting that the subscript for the group () method starts at 1.
The above example only works with the <a> element with an href attribute only, and if there are other attributes after the href attribute in the <a> element, the match call above will return an incorrect value according to the longest matching principle:
<a href= "Http://myblog.wordpress.com/rss" alt= "RSS feed-yet another wordpress Blog" >rss feed</a>
Will match as:http://myblog.wordpress.com/rss "alt=" RSS feed-yet another wordpress Blog
at present, there is no particularly good solution to this situation, my practice is to split the space before matching. since href is usually the first attribute in <a>, it can be handled simply as follows:
Splits = Aelem.split (")
# element number No. 0 is ' <a ' and element 1th is ' href= ' http://myblog.wordpress.com/ '
Aelem = splits[1]
# the regular-style correspondence here changes
matches = Re.match ('href= ' (. *) "', Aelem)
Of course, this method does not guarantee that 100% is correct. The best practice should be to use HTML Parser. It's too lazy here.
After extracting the URL, add the URL to the URL library. To avoid repeated access, you need to repeat the URL, which leads to the use of the dictionary in the next section.
Dictionary
Dictionary, an associative container for storing key-value pairs, corresponding to Java.util.HashMap in C + + and dictionary in C #. Because key is unique, the dictionary can be used to go heavy. Of course, you can also use set, a lot of set is to simply wrap the map, such as Java.util.HashSet and Stl::hash_set.
To build a URL library using a dictionary, first we need to consider what the URL library needs to do:
- URL de-weight: After the URL is extracted from the HTML code, if the URL is already added to the URL library, do not join the URL library
- Fetch new URL: Take a URL from the URL library that you haven't visited for the next crawl
In order to achieve 1, 2 at the same time, there are the following two intuitive practices:
- Use a URL dictionary where the URL is the key and whether it has been accessed (true/false) as value;
- Use two dictionaries, one to go heavy, and another to store URLs that have not yet been visited.
For the sake of simplicity, the 2nd method is used:
#Start URL
StartURL ='http://myblog.wordpress.com/';
#full URL for URL de-weight
Totalurl[starturl] = True
#URL not visited to maintain list of inaccessible URLs
Unusedurl[starturl] = True
#omit several codes in the middle
#remove an unused URL
Nexturl = Unusedurl.keys () [0];
#Remove the URL from Unusedurl
delUnusedurl[nexturl]
#Get page Content
Content = Get_file_content (nexturl)
#extracting URLs from a page
Urllist = Extract_url (content)
#for each URL that is extracted
forUrlinchUrllist:
#If the URL does not exist in Totalurl
if notTotalurl.has_key (URL):
#then it must be non-repeating, adding it to the Totalurl
Totalurl[url] = True
#and join as the access list
Unusedurl[url] = True
End
Finally, put the complete code:
ImportUrllib2
ImportTime
ImportRe
Totalurl = {}
Unusedurl = {}
#Generating Proxyhandler Objects
defGet_proxy ():
returnUrllib2. Proxyhandler ({'http':"localhost:8580"})
#generate a Url_opener to the agent
defGet_proxy_http_opener (proxy):
returnUrllib2.build_opener (proxy)
#gets the page that the specified URL points to, calling the first two functions
defGet_file_content (URL):
Opener = Get_proxy_http_opener (Get_proxy ())
Content = Opener.open (URL). Read ()
Opener.close ()
#to facilitate regular matching, remove the line breaks
returnContent.replace ('\ r',"'). Replace ('\ n',"')
#Extract page titles based on Web page HTML code
defExtract_title (content):
Titleelem = Re.findall ('<title>.*<\/title>', content) [0]
returnRe.match ('<title> (. *) <\/title>', Titleelem). Group (1). Strip ()
#extract URLs from all <a> tags based on web page HTML code
defExtract_url (content):
Urllist = []
Aelems = Re.findall ('<a href= ".*?<\/a>', content)
forAeleminchAelems:
Splits = Aelem.split (' ')
ifLen (splits)! = 1:
Aelem = splits[1]
##print Aelem
Matches = Re.match ('href= "(. *)"', Aelem)
ifMatches is notNone:
url = matches.group (1)
ifRe.match ('http:\/\/myblog\.wordpress\.com.*', URL) is notNone:
Urllist.append (URL)
returnUrllist
#gets the time of the string format
defGet_localtime ():
returnTime.strftime ("%h:%m:%s", Time.localtime ())
#Main function
defBegin_access ():
StartURL ='http://myblog.wordpress.com/';
Totalurl[starturl] = True
Unusedurl[starturl] = True
Print 'Seq\ttime\ttitle\turl'
i = 0
whileI < 150:
Nexturl = Unusedurl.keys () [0];
delUnusedurl[nexturl]
Content = Get_file_content (nexturl)
title = Extract_title (content)
Urllist = Extract_url (content)
forUrlinchUrllist:
if notTotalurl.has_key (URL):
Totalurl[url] = True
Unusedurl[url] = True
Print '%d\t%s\t%s\t%s'% (i, Get_localtime (), title, Nexturl)
i = i + 1
Time.sleep (2)
#calling the main function
Begin_access ()
First Python program--Blog Automatic access script