Tutorial on using txt2html to implement a web filtering agent under Python

Last Update:2016-06-06 Source: Internet

Author: User

Tags html form gopher python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the course of writing this developerWorks series, I have had problems writing in the best format. Word processor formats are dedicated, and conversion between formats is not always desirable and cumbersome (and each format will each bind the document to a different specialized tool, which is contrary to the spirit of open source). HTML is also neutral--perhaps the article you're reading right now is the format--but it also adds tags that can easily cause false typing (or make people bind to HTML-enhanced editors). DocBook is an interesting XML format that can be converted into many target formats, and it has the correct semantics for technical articles (or books), but like HTML, there are many tags to worry about during the writing process. LaTeX is particularly well-suited for complex printing formats, but it also has many markings that do not require complex printing formats.

To be able to really worry about writing--especially with the neutrality of platforms and tools--unformatted ASCII is the best choice. However, the Internet (especially Usenet) recommends that an informal standard for "smart ASCII" documents be developed on the basis of completely unformatted text (see Resources). Smart ASCII adds only a little extra semantic content and context, and they look "natural" in the text display. e-mail, newsgroup messages, FAQs, Project Readme (Readme), and other electronic documents usually include some print/semantic elements, such as asterisks before and after the accent, underscores below the title, vertical and horizontal spaces that describe the relationship of the text, a selection of all caps, and some other information. Project Gutenberg (see Resources) is an astonishing result of adding many ideas to its own format concept, and considers "smart ASCII" to be the best choice for long-time preservation and distribution of good books. Even though these articles would not be as enduring as literary classics, they decided to write them in "smart ASCII" format and automatically convert them to other formats using a handy Python script.
Introduction txt2html

Txt2html was originally a simple file converter, as you can see from its name. But the internet suggests adding several notable enhancements to the tool. Because there are many documents that readers would like to see in "HTML" format after http: or ftp: Link, the tool should really work directly with such remote documents (without having to download/convert/view cycle cycles). Because the goal of the transformation is ultimately HTML, what we usually do is to see the converted target document in a Web browser.

By putting these together, Txt2html became a "Web-based filtering agent". The word is very peculiar, and may just "express its meaning completely". They embody the idea that a program reads a WEB page (or other resource) on your behalf, processes the content in some way, and then displays the page to you in a better form than the original page (at least for some special purpose). A good example of this tool is the Babelfish translation service (see Resources). After you run the URL through Babelfish, the Web page you see is very similar to the original page, but it shows the text you can read, not the language you don't understand. To some extent, all search engines that display a summary of the search results page are doing the same thing. But those search engines (by design) have more freedom in the format and appearance of the target page, while removing a lot of content. Of course, txt2html is not as powerful as Babelfish, but conceptually they do the same thing to a large extent. See Resources for more examples, some of which are humorous.

The biggest advantage of txt2html is the use of many programming techniques, which are common to different Web-oriented Python uses. This article will cover those tips and will explain the coding techniques and the scope of some Python modules. Note: The actual module in txt2html is called dmtxt2html to avoid conflicts with the module names written by others.

Using the CGI module

The CGI module in the Python standard release is an unexpected surprise for anyone who develops a "Common Gateway Interface" application in Python. You can not use it to create a CGI, but you will not do so.

In the most general case, you interact with the CGI application through an HTML form. To fill out the form that calls the CGI to perform the operation using the specification. For example, the txt2html document uses this example to invoke an HTML form (the form generated by txt2html itself is more complex and may change, but the example will work well, even in your own Web page):
Call the ' txt2html ' HTML form

You can include many input fields in an HTML form, and a field can be one of many different types (text, check boxes, radio lists, radio buttons, and so on). Any good book that tells HTML can help beginners create custom HTML forms. The best thing to keep in mind here is that each field has a name attribute, which is then referenced in a CGI script later by using that name. Another detail to understand is that the form can use one of the following two methods: "Get" and "post". The basic difference is that "get" includes query information in the URL, and this method makes it easy for users to save specific queries for later reuse. On the other hand, if you do not want users to save the query, use the "post" method.

The tedious Python script for the table above will import the CGI to make it easier to organize their call forms. One of the things that this module does is to hide any details of the differences between the "get" and "post" methods in a CGI script. This is not the detail that the CGI creator needs to worry about before making the call. The main function of the CGI module is to handle all fields in the calling HTML form that resemble the dictionary style. What you get is not a true Python dictionary, but they are used in very similar ways:
Using the Python [CGI] Module

Import     CGI, sys  cfg_dict = {    ' target ':     '
 
  
   
  }  sys.stderr = sys.stdout  form = cgi. Fieldstorage ()      if     form.has_key (    ' source '):   cfg_dict[    ' source '] = form[    ' source '). Value

In the above lines, pay attention to a few details. One technique we use is to set sys.stderr = Sys.stdout. If our script encounters an uncaught error, this operation will be traced back to the client browser. This can save a lot of Time debugging CGI applications. But you may not want users to see this (or if they might report the problem details to you, you might show it to the user). Next, we read the HTML form values into a dictionary-like form instance. The form has a. Has_key () method, which is very similar to a true Python dictionary. However, unlike the Python dictionary, to actually get the value in the key, we must look at the. Value property of the key.

At this point, everything in the HTML form is a pure Python variable, and we can handle them in any other Python program.

Using the Urllib module

Like most Python modules, urllib handles many complex things in an intuitive and simple way. The Urlopen () function in Urllib can handle any remote resource-whether http:, ftp: or Gopher:-as a local file. If you use Urlopen () to crawl a remote (pseudo) file object, you can treat it as a file object for a local (read-only) file:
Using the Python [Urllib] Module

From     urllib     import     urlopen      import     string  Source = cfg_dict[    ' source ']      if     Source = =     '
 
  
   
  :   fhin = Sys.stdin      else    :       try    :   Fhin = Urlopen ( SOURCE)       except    :   errreport (source+    ' could not being opened! ', cfg_dict)       return       doc =     fhin.readlines ():     # need to normalize line endings!   doc = Doc+string.rstrip (line) +    ' \ n
 '

I've had a little problem, because the platform that generated the resources and your platform use different end-of-line conventions, something strange might happen in the generated text (which seems to be an error in urllib). The solution to this problem is to perform a small. ReadLines () loop in the above code. Regardless of the resource's original appearance, this operation will give you a string that has the correct end-of-line convention for using the platform (presumably reasonable).

Using the RE module

Due to the space limitations of this article, only a subset of the rule expressions are discussed here. A number of reference books on the subject are listed in the Resources section. The RE module is used extensively in txt2html to identify various text patterns in the source text. Let's look at a more complicated example:
Using the Python [re] module

Import     re      def      urlify    (TXT):   txt = re.sub (    ' (?: http|ftp|gopher|file)://(?: [^ \n\r<\)]+ ) (\s) ',       ' \\1\\2 ', txt]       return     txt

Urlify () is a small function that functions as its name implies. If you encounter a URL-like string in the "Smart ASCII" file, it is converted to the same URL in the HTML output as a true hot link. Let's look at the role of Re.sub (). First, from the most important point of view, the purpose of the function is "to find a string that matches the value in the first pattern, and then replace the matching result with the second pattern by using the third variable as the string to be manipulated." Very well, from these points of view, it is no different from string.replace ().

The first pattern has several elements. First, note the parentheses: The highest level consists of two pairs of parentheses: a complex string before (\s). Parentheses match the sub-expression of the possible component substitution pattern. The second subexpression (\s) only means "find a string that matches any space character, so let's go back and see what matches it." So let's take a look at the first sub-expression.

The Python rule expression has some of its own tricks. One of the tricks is where the subexpression begins?: operator. This means "finding a matching string for a sub-pattern, but not including a matching result in a reverse reference". So, let's look at this sub-expression:

((?: http|ftp|gopher|file)://(?: [^ \n\r<\)]+)).

First, notice that the subexpression itself consists of two sub-expressions, and that the characters between them do not belong to either party. However, each subexpression is preceded by the?:, which indicates that the two match, but not for the purpose of the reference. The first "non-reference" subexpression only means "look for a string similar to HTTP or FTP or another value." Next, we see the string://, which represents a string that looks exactly like it (simple?). ）。 Finally, we see a second subexpression, which has a plus sign in addition to the "Do not Reference" operator, which consists of square brackets.

In a regular expression, the square brackets represent only "find matching characters for any character in square brackets." However, if the first character is an caret (^), the meaning is reversed, which means "find any characters that do not match the characters that follow." Therefore, we are looking for characters that are not spaces, CR, LF, "<" or ")" (also note that you can avoid characters that have special meaning for regular expressions by adding "\" to the characters). The plus sign at the end means "find one or more matches of the last string" (the asterisk denotes "0 or more", and the question mark denotes "0 or one").

There's a lot to understand about this rule expression, but if you look at it a few more times, you'll see that this is the URL format.

Then there is the replacement part. It's even easier. The part that looks like \\1 and \\2 (or \\3, \\4, and so on, if needed) is the "reverse reference" just mentioned. The \\1 (or \\2) representation pattern matches the first (or second) subexpression of a matching expression. The rest of the replacement part has no special meaning: just some characters that are easily recognizable as HTML code. One thing that is more troublesome is matching \\2-it looks like a space character. Maybe someone would ask, "What's The trouble?" Why not insert a space character directly? "Well, actually, we don't need to do the work on HTML," he says. But from an aesthetic standpoint, it's a good idea to have the HTML output keep the source text file appearance before converting to HTML markup as much as possible. In particular, let's keep line breaks as newline characters, spaces are spaces (the TAB key is the TAB key).



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More