Motivation
Today, a friend wrote a letter saying that he thought the access statistics displayed in his Wordpress blog were abnormal and hoped that I could create some access information for him for comparison. A friend's request is to quickly open 100 different blog pages in a short time, so that he can understand the blog access data from the traffic changes.
As a computer developer, I have the impulse to automate any repetitive work. Therefore, although it is not complicated to manually execute the task of opening 100 Web pages, I have completely rejected it from the very beginning. I just wanted to learn python for a long time, so I took this opportunity to learn it. By the way, I recorded my first Python learning achievements.
This article uses Python 2.7.3 to implement a script for automatic blog access, involving the following technical points:
- Language basics
- Container (linear table, dictionary)
- Logical branches and loops
- Console formatting output
- HTTP client network programming
- Process HTTP requests
- Use HTTP Proxy Server
- Python Regular Expression
Overview
The automatic access to the blog page is actually similar to that of web crawlers. The basic process is as follows:
Figure 1 How blog auto accessors work
- Give the accessor a starting position (for example, the blog homepage URL)
- The accessor crawls the webpage pointed to by the URL (the operation of crawling back the webpage itself is equivalent to opening the page in the browser)
- 2. The crawled webpage is handed over to the analyzer for analysis. After analysis, the analyzer extracts the URLs and adds them to the list of URLs to be accessed, that is, the URL library. Then retrieve the next page url from the URL Library
- Go through steps 2 and 3 until a termination condition is reached and the program exits.
At the beginning of programming, we had nothing, and there was only one blog homepage URL. The problem to be solved with this URL is how to program and crawl the page pointed to by the URL. After crawling to the page, you can access the blog and obtain the content of the page, so as to extract more URLs for further crawling.
In this way, the problem of how to obtain page information based on the URL is solved. To solve this problem, you need to use HTTP client programming. This is also the problem solved in the next section.
Urllib2: HTTP client Programming
There are several libraries in Python that can implement HTTP client programming, such as httplib, urllib, and urllib2. Urllib2 is used because it is powerful, easy to use, and can easily use HTTP proxy.
Using urllib2 to create an HTTP connection and obtain the page is very simple, only three steps are required:
import urllib2
opener = urllib2.build_opener()
file = opener.open(url)
content = file.read()
Content is the style of the HTTP request response, that is, the HTML code of the webpage. If you need to set a proxy, you can pass in a parameter when calling build_opener:
opener = urllib2.build_opener(urllib2.ProxyHandler({'http': "localhost:8580"}))
The proxyhandler function accepts a dictionary-type parameter, where key is the protocol name and value is the host and port number. It also supports proxy with verification. For more details, see the official documentation.
The problem to be solved is to separate the URL from the page. This requires a regular expression.
Regular Expression
The function related to the regular expression is located in the RE module of Python. before using the function, import re
The findall function returns all substrings that meet the given regular expression in the string:
aelems = re.findall('<a href=".*<\/a>', content)
The first parameter of findall is a regular expression, the second parameter is a string, and the return value is a string array, which contains all the substrings In the content that meet the given regular expression. The preceding Code returns all substrings starting with <a href = "and ending with </a>, that is, all <A> labels. Apply this filter to HTML code of a webpage to obtain all hyperlinks. If you need to further improve the accuracy of filtering, for example, you only need to link to the blog (assuming the current blog is a http://myblog.wordpress.com), and the URL is an absolute address, you can use a more accurate regular expression, for example, '<a href = "HTTP \: \/myblog \. wordPress \. com. * <\/A> '.
After obtaining the <A> tag, you need to extract the URL. The match function is recommended here. The match function is used to return part of a substring that meets the specified regular expression. For example, if the following string is stored in the aelem element ):
<A href = "http://myblog.wordpress.com/rss"> RSS feed </a>
If you need to extract the URL (http://myblog.wordpress.com/rss), you can call the following match:
matches = re.match('<a href="(.*)"', aelem)
If the match is successful, match returns the matchobject object; otherwise, none. for matchobject, you can use the groups () method to obtain all the elements contained in the object, or you can use the group (subscript) method. Note that the subscript of the group () method starts from 1.
The preceding example is only valid for <A> elements with only one href attribute. If <A> element has another attribute after the href attribute, follow the longest matching principle, the above match call will return an incorrect value:
<A href = "http://myblog.wordpress.com/rss" alt = "RSS Feed-yet another Wordpress blog"> RSS feed </a>
Will match: http://myblog.wordpress.com/rss "alt =" RSS Feed-yet another Wordpress blog
At present, there is no particularly good solution for this situation. My approach is to split it by space before matching. Href is usually the first attribute in <A>, so it can be processed as follows:
Splits = aelem. Split ('')
# Element 0 is '<A', element 1 is' href = "http://myblog.wordpress.com /"'
Aelem = splits [1]
# Here the regular expression corresponds to a change
Matches = Re. Match ('href = "(. *)" ', aelem)
Of course, this method cannot ensure 100% is correct. The best way is to use HTML Parser. I am too lazy to do it here.
After the URL is extracted, the URL must be added to the URL library. To avoid repeated access, you need to repeat the URL, which leads to the use of the dictionary in the next section.
Dictionary
Dictionary, an associated container that stores key-value pairs, corresponding to STL: hash_map in C ++, Java. util. hashmap and dictionary in C. the key is unique, so the dictionary can be used to remove duplicates. Of course, you can also use set. Many sets are simple packaging of map, such as Java. util. hashset and STL: hash_set.
To build a URL library using a dictionary, first consider what the URL library needs to do:
- URL deduplication: After the URL is extracted from the HTML code, if it is a URL that has been added to the URL library, it is not added to the URL library.
- Get new URL: Get a URL that has not been accessed from the URL library for the next crawling
To achieve both 1 and 2 at the same time, there are two intuitive methods:
- Use a URL dictionary, where the URL is used as the key and whether it has been accessed (true/false) as the value;
- Use two dictionaries, one for deduplication and the other for storing URLs that have not been accessed.
For the sake of simplicity, the following 2nd methods are used:
# Starting URL
Starturl = 'HTTP: // myblog.wordpress.com /';
# All URLs for URL deduplication
Totalurl [starturl] = true
# Unaccessed URL, used to maintain the list of unaccessed URLs
Unusedurl [starturl] = true
# Omitted code
# Retrieve the next unused URL
Nexturl = unusedurl. Keys () [0];
# Delete the URL from unusedurl
Del unusedurl [nexturl]
# Retrieving page content
Content = get_file_content (nexturl)
# Extracting URLs from the page
Urllist = extract_url (content)
# For each extracted URL
For URL in urllist:
# If the URL does not exist in totalurl
If not totalurl. has_key (URL ):
# It must not be repeated. Add it to totalurl.
Totalurl [url] = true
# Add it to the access list
Unusedurl [url] = true
End
Finally, paste the complete code:
Import urllib2
Import time
Import re
Totalurl = {}
Unusedurl = {}
# Generate a proxyhandler object
Def get_proxy ():
Return urllib2.proxyhandler ({'HTTP ': "localhost: 8580 "})
# Generate the url_opener pointing to the proxy
Def get_proxy_http_opener (proxy ):
Return urllib2.build _ opener (proxy)
# Obtain the webpage pointed to by the specified URL. The first two functions are called.
Def get_file_content (URL ):
Opener = get_proxy_http_opener (get_proxy ())
Content = opener. Open (URL). Read ()
Opener. Close ()
# Remove the linefeed to facilitate regular expression matching
Return content. Replace ('\ R', ''). Replace (' \ n ','')
# Extracting webpage titles Based on Webpage HTML code
Def extract_title (content ):
Titleelem = Re. findall ('<title>. * <\/title>', content) [0]
Return re. Match ('<title> (. *) <\/title>', titleelem). Group (1). Strip ()
# Extracting URLs from all <A> tags Based on the HTML code of the webpage
Def extract_url (content ):
Urllist = []
Aelems = Re. findall ('<a href = ".*? <\/A> ', content)
For aelem in aelems:
Splits = aelem. Split ('')
If Len (splits )! = 1:
Aelem = splits [1]
# Print aelem
Matches = Re. Match ('href = "(. *)" ', aelem)
If matches is not none:
Url = matches. Group (1)
If Re. Match ('HTTP: \/myblog \. WordPress \. com. * ', URL) is not none:
Urllist. append (URL)
Return urllist
# Obtain the time of string format
Def get_localtime ():
Return time. strftime ("% H: % m: % s", time. localtime ())
# Main Function
Def begin_access ():
Starturl = 'HTTP: // myblog.wordpress.com /';
Totalurl [starturl] = true
Unusedurl [starturl] = true
Print 'seq \ tTime \ ttitle \ turl'
I = 0
While I <150:
Nexturl = unusedurl. Keys () [0];
Del unusedurl [nexturl]
Content = get_file_content (nexturl)
Title = extract_title (content)
Urllist = extract_url (content)
For URL in urllist:
If not totalurl. has_key (URL ):
Totalurl [url] = true
Unusedurl [url] = true
Print '% d \ t % s \ t % s' % (I, get_localtime (), title, nexturl)
I = I + 1
Time. Sleep (2)
# Calling the main function
Begin_access ()