Python _ blog automatic Access Script

Source: Internet
Author: User

This article uses Python 2.7.3 to implement a script for automatic blog access, involving the following technical points:

  • Language basics

    • Container (linear table, dictionary)
    • Logical branches and loops
    • Console formatting output
  • HTTP client network programming
    • Process HTTP requests
    • Use HTTP Proxy Server
  • Python Regular Expression
Overview

The automatic access to the blog page is actually similar to that of web crawlers. The basic process is as follows:

Figure 1 How blog auto accessors work

  1. Give the accessor a starting position (for example, the blog homepage URL)
  2. The accessor crawls the webpage pointed to by the URL (the operation of crawling back the webpage itself is equivalent to opening the page in the browser)
  3. 2. The crawled webpage is handed over to the analyzer for analysis. After analysis, the analyzer extracts the URLs and adds them to the list of URLs to be accessed, that is, the URL library. Then retrieve the next page url from the URL Library
  4. Go through steps 2 and 3 until a termination condition is reached and the program exits.

At the beginning of programming, we had nothing, and there was only one blog homepage URL. The problem to be solved with this URL is how to program and crawl the page pointed to by the URL. After crawling to the page, you can access the blog and obtain the content of the page, so as to extract more URLs for further crawling.

In this way, the problem of how to obtain page information based on the URL is solved. To solve this problem, you need to use HTTP client programming. This is also the problem solved in the next section.

Urllib2: HTTP client Programming

There are several libraries in Python that can implement HTTP client programming, such as httplib, urllib, and urllib2. Urllib2 is used because it is powerful, easy to use, and can easily use HTTP proxy.

Using urllib2 to create an HTTP connection and obtain the page is very simple, only three steps are required:

import urllib2
opener = urllib2.build_opener()file = opener.open(url)content = file.read()

Content is the style of the HTTP request response, that is, the HTML code of the webpage. If you need to set a proxy, you can pass in a parameter when calling build_opener:

?
opener =
urllib2.build_opener(urllib2.ProxyHandler({'http':
"localhost:8580"}))

The proxyhandler function accepts a dictionary-type parameter, where key is the protocol name and value is the host and port number. It also supports proxy with verification. For more details, see the official documentation.

The problem to be solved is to separate the URL from the page. This requires a regular expression.

Regular Expression

The function related to the regular expression is located in the RE module of Python. before using the function, import re

The findall function returns all substrings that meet the given regular expression in the string:

aelems = re.findall('<a href=".*<\/a>', content)

The first parameter of findall is a regular expression, the second parameter is a string, and the return value is a string array, which contains all the substrings In the content that meet the given regular expression. The preceding Code returns all substrings starting with <a href = "and ending with </a>, that is, all <A> labels. Apply this filter to HTML code of a webpage to obtain all hyperlinks. If you need to further improve the accuracy of filtering, for example, you only need to link to the blog (assuming the current blog is a http://myblog.wordpress.com), and the URL is an absolute address, you can use a more accurate regular expression, for example, '<a href = "HTTP \: \/myblog \. wordPress \. com. * <\/A> '.

After obtaining the <A> tag, you need to extract the URL. The match function is recommended here. The match function is used to return part of a substring that meets the specified regular expression. For example, if the following string is stored in the aelem element ):

<A href = "http://myblog.wordpress.com/rss"> RSS feed </a>

If you need to extract the URL (http://myblog.wordpress.com/rss), you can call the following match:

matches = re.match('<a href="(.*)"', aelem)

If the match is successful, match returns the matchobject object; otherwise, none. for matchobject, you can use the groups () method to obtain all the elements contained in the object, or you can use the group (subscript) method. Note that the subscript of the group () method starts from 1.

The preceding example is only valid for <A> elements with only one href attribute. If <A> element has another attribute after the href attribute, follow the longest matching principle, the above match call will return an incorrect value:

<A href = "http://myblog.wordpress.com/rss" alt = "RSS Feed-yet another Wordpress blog"> RSS feed </a>

Will match: http://myblog.wordpress.com/rss "alt =" RSS Feed-yet another Wordpress blog

At present, there is no particularly good solution for this situation. My approach is to split it by space before matching. Href is usually the first attribute in <A>, so it can be processed as follows:

Splits = aelem. split ('') # element 0 is '<', element 1 is 'href = "http://myblog.wordpress.com/" 'aelem = splits [1] # Here the regular expression corresponds to changing matches = Re. match ('href = "(. *) "', aelem)

Of course, this method cannot ensure 100% is correct. The best way is to use HTML Parser. I am too lazy to do it here.

After the URL is extracted, the URL must be added to the URL library. To avoid repeated access, you need to repeat the URL, which leads to the use of the dictionary in the next section.

Dictionary

Dictionary, an associated container that stores key-value pairs, corresponding to STL: hash_map in C ++, Java. util. hashmap and dictionary in C. the key is unique, so the dictionary can be used to remove duplicates. Of course, you can also use set. Many sets are simple packaging of map, such as Java. util. hashset and STL: hash_set.

To build a URL library using a dictionary, first consider what the URL library needs to do:

  1. URL deduplication: After the URL is extracted from the HTML code, if it is a URL that has been added to the URL library, it is not added to the URL library.
  2. Get new URL: Get a URL that has not been accessed from the URL library for the next crawling

To achieve both 1 and 2 at the same time, there are two intuitive methods:

  1. Use a URL dictionary, where the URL is used as the key and whether it has been accessed (true/false) as the value;
  2. Use two dictionaries, one for deduplication and the other for storing URLs that have not been accessed.

For the sake of simplicity, the following 2nd methods are used:

# Starting urlstarturl = 'HTTP: // myblog.wordpress.com/'{l#ururl, used for urldeduplication totalurl#starturl] = true # unaccessed URL, used to maintain the unusedurl list unusedurl [starturl] = true # omitted Code # obtain the next unused urlnexturl = unusedurl. keys () [0]; # Delete del unusedurl [nexturl] From unusedurl # retrieve page content = get_file_content (nexturl) # extract urlurllist = extract_url (content) # For each extracted urlfor URL in urllist: # If the URL does not exist in totalurl, if not totalurl. has_key (URL): # It must be unique. Add it to totalurl. totalurl [url] = true # and add it to the access list as unusedurl [url] = true.
End

Finally, paste the complete code:

Import urllib2import timeimport retotalurl ={} unusedurl ={}# generate the proxyhandler object def get_proxy (): Return urllib2.proxyhandler ({'http': "localhost: 8580 "}) # generate the url_openerdef get_proxy_http_opener (proxy) pointing to the Proxy: Return urllib2.build _ opener (proxy) # obtain the webpage pointed to by the specified URL and call the first two functions def get_file_content (URL ): opener = get_proxy_http_opener (get_proxy () content = opener. open (URL ). read () opener. close () # Remove the retu line break to facilitate regular expression matching Rn content. replace ('\ R ',''). replace ('\ n', '') # extract the webpage title def extract_title (content): titleelem = Re. findall ('<title>. * <\/title> ', content) [0] Return re. match ('<title> (. *) <\/title> ', titleelem ). group (1 ). strip () # extract all urldef extract_url (content): urllist = [] aelems = Re. findall ('<a href = ". *? <\/A> ', content) for aelem in aelems: splits = aelem. Split ('') If Len (splits )! = 1: aelem = splits [1] # print aelem matches = Re. match ('href = "(. *) "', aelem) If matches is not none: url = matches. group (1) If re. match ('HTTP: \/myblog \. wordPress \. com. * ', URL) is not none: urllist. append (URL) return urllist # obtain the string format time def get_localtime (): return time. strftime ("% H: % m: % s", time. localtime () # Main Function def begin_access (): starturl = 'HTTP: // myblog.wordpress.com/'; totalurl [starturl] = true unusedurl [starturl] = True Print 'seq \ tTime \ ttitle \ Turl 'I = 0 while I <150: nexturl = unusedurl. keys () [0]; del unusedurl [nexturl] content = get_file_content (nexturl) Title = extract_title (content) urllist = extract_url (content) for URL in urllist: If not totalurl. has_key (URL): totalurl [url] = true unusedurl [url] = true print '% d \ t % s \ t % s' % (I, get_localtime (), title, nexturl) I = I + 1 time. sleep (2) # Call the main function begin_access ()
 
Reprinted from: http://www.cnblogs.com/mdyang/archive/2012/04/03/first_python_script.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.