A tutorial on using txt2html to implement web filtering agents under Python

A tutorial on using txt2html to implement web filtering agents under Python _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags html form gopher stdin python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the process of writing this developerWorks series, I have encountered problems with writing in the best format. Word processor formats are proprietary, and converting between formats is not always desirable and cumbersome (and each format binds the document to a different proprietary tool, which contradicts the spirit of the open source). HTML is fairly neutral--perhaps the same format you're reading now--but it also adds tags that can easily lead to false input (or bind people to HTML-enhanced editors). DocBook is an interesting XML format that can be converted into many target formats, and it has the correct semantics for technical articles (or books), but like HTML, there are many tags to worry about in writing. LaTeX is especially good for complex printing formats, but it also has a lot of tags, and these articles don't require complex printing formats.

Unformatted ASCII is the best option in order to really worry about writing-especially with the neutrality of platforms and tools. However, the Internet (especially Usenet) recommends the development of an informal standard for "smart ASCII" documents based on completely unformatted text (see Resources). Smart ASCII adds only a little extra semantic content and context, and they look so "natural" in the text display. e-mail, newsgroup messages, FAQs, Project Readme (Readme), and other electronic documents usually include some printing/semantic elements, such as asterisks before and after accent words, underscores under headings, vertical and horizontal spaces that describe text relationships, selective uppercase letters, and other information. Project Gutenberg (see Resources) is an amazing achievement, adding many ideas to its own format and thinking that "smart ASCII" is the best choice for a long time to save and distribute good books. Even if these articles are not as enduring as the literary classics, they still decide to write them in "smart ASCII" format and automatically convert them to other formats using handy Python scripts.
Introduction txt2html

Txt2html was originally a simple file converter, as can be seen from its name. But the internet suggests adding a few obvious enhancements to the tool. Because many of the documents that readers want to view in HTML format are behind http: or ftp: links, the tool should really deal directly with such remote documents (without the need to download/convert/view the cycle). Because the goal of the conversion is ultimately HTML, all we have to do is view the converted target document in a Web browser.

After putting these together, txt2html becomes a "web-based filter agent." The word is so peculiar that it may just "fully express its meaning". They reflect the idea that a program reads a WEB page (or other resource) on your behalf, handles the content in some way, and then displays the page to you in a form that is better than the original page (at least for some special purpose). A good example of this tool is the Babelfish translation service (see Resources). After you run the URL through Babelfish, the Web page you see is very similar to the original page, but it shows the text you can read, not the language you don't understand. In a way, all the search engines that display a summary of search results are doing the same thing. But those search engines (by design) have more latitude in the format and appearance of the target page, while removing a lot of content. Of course, txt2html is not as powerful as Babelfish, but conceptually, they do the same thing to a large extent. See Resources for more examples, some of which are humorous.

The biggest advantage of txt2html is the use of many programming techniques that are common to different Web-oriented uses of Python. This article describes the techniques and explains the coding techniques and the scope of some Python modules. Note that the actual module in txt2html is called dmtxt2html to avoid conflicting module names written by others.

Using the CGI module

The CGI module in the Python standard release is an unexpected surprise for anyone using Python to develop a "Common Gateway Interface" application. You can not use it to create a CGI, but you will not do so.

Most often, you interact with a CGI application through an HTML form. To fill out a form called CGI to perform an operation that uses the specification. For example, the txt2html document uses this example to invoke an HTML form (the form generated by txt2html itself is more complex and may change, but the example will work well, even on your own Web page):
An HTML form that calls ' txt2html '

You can include many input fields in an HTML form, and fields can be one of many different types (text, check boxes, radio lists, radio buttons, and so on). Any good book that tells HTML can help beginners create custom HTML forms. The best thing to keep in mind here is that each field has a name attribute and will later use that name in the CGI script to refer to the field. Another detail that needs to be learned is that a form can use one of the following two methods: "Get" and "post." The basic difference is that "get" includes query information in the URL, and this method makes it easy for users to save specific queries for reuse later. On the other hand, if you do not want the user to save the query, use the Post method.

The above table's tedious Python script will import CGI to make it easier to defragment its calling form. One of the things that this module does is hide any details of the differences between the "get" and "post" methods in the CGI script. This is not the detail that the CGI creation program needs to worry about until the call is made. The main function of the CGI module is to handle all the fields in the call HTML form that resemble the dictionary style. What you get is not really a Python dictionary, but they are very similar in style:
using the Python [CGI] module

Import
     CGI, sys
  cfg_dict = {
    ' target ': 
    ' <STDOUT> '}
  sys.stderr = sys.stdout
  form = cgi. Fieldstorage ()
  
    if
     form.has_key (
    ' source '):
   cfg_dict[
    ' source ' = form[
    ' source ']. Value

In the above lines, pay attention to a few details. One technique we use is to set sys.stderr = Sys.stdout. If our script encounters an error that is not caught, the trace is traced back to the client browser. This saves a lot of time to debug a CGI application. But you may not want users to see these (or if they may report the problem details to you, you might be able to show them to the user). Next, we read the HTML form values to a form instance similar to the dictionary. The form has a. Has_key () method, which is very similar to a real Python dictionary. However, unlike the Python dictionary, to actually get the value in the key, we must look at the. Value property of the key.

At this point, everything in an HTML form becomes a pure Python variable, and we can handle them in any other Python program.

Using the Urllib module

Like most Python modules, urllib deals with many complex things in an intuitive and simple way. The Urlopen () function in Urllib can handle any remote resource-whether http:, ftp: or Gopher:-to treat it as a local file. If you use Urlopen () to crawl a remote (pseudo) file object, you can treat it as a file object for a local (read-only) file:
Using the Python [Urllib] Module

From
     urllib 
    import
     urlopen
  
    import
     string
  Source = cfg_dict[
    ' source '
  
    if
     Source = 
    ' <STDIN> ':
   fhin = Sys.stdin
  
    else
    :
   
    try
    :
   Fhin = urlopen (source)
   
    except
    :
   errreport (source+
    ' could not to opened! ', cfg_dict)
   
    return
  
     doc = 
    ' For line in
     fhin.readlines (): 
    # Need to normalize line endings!
   doc = Doc+string.rstrip (line) +
    ' \ n '

I've had a little problem, because the platform that generates the resource and your platform use different line-end conventions, something strange may happen in the generated text (this seems to be an error in urllib). The solution to this problem is to perform a small. ReadLines () loop in the above code. Regardless of what the resource turns out to be, this action gives you a string that has the correct row-end conventions for using the platform (presumably reasonable).

Using the RE module

Due to the limitations of this article, only part of the rule expression is discussed here. Many reference books on this topic are listed in the resources. The RE module is used extensively in txt2html to identify various text patterns in the source text. Let's look at a more complicated example:
Using the Python [re] module

Import
     re
  
    def
      urlify
    (TXT):
   txt = re.sub (
    ' (?: http|ftp|gopher|file)://(?: [^ \n\r<\)]+ )) (\s) ',
   
    ' <a href= ' \\1 ' >\\1</a>\\2 ', txt)
   
    return
     txt

Urlify () is a compact function that functions as shown in its name. If a string similar to a URL is encountered in the smart ASCII file, it will be converted to a true hot link to the same URL in the HTML output. Let's look at the role of Re.sub (). First, from the most important point of view, the purpose of the function is "to find a string that matches the value in the first pattern, and then replace the matching result with the third variable as the string to manipulate." Very well, from these points of view, it is no different from string.replace ().

The first pattern has several elements. First, note the parentheses: The top level consists of two pairs of parentheses: (\s) a complex string before. Parentheses match the "subexpression" of a possible component substitution pattern. The second subexpression (\s) only means "find a string that matches any space character, so let's go back and see what matches." So let's look at the first subexpression.

The Python rule expression has some of its own tricks. One of the tricks is the "operator" at the beginning of the subexpression. This means "find a matching string for a child pattern, but does not include a matching result in a reverse reference." So, let's check this subexpression:

((?: http|ftp|gopher|file)://(?: [^ \n\r<\)]+)).

First, notice that the subexpression itself consists of two subexpression, the character of which does not belong to either party. However, each subexpression starts with a?: to indicate that the two matches, but not to satisfy the referential purpose. The first "unreferenced" subexpression represents only "find strings similar to HTTP or FTP or other values." Next, we see the string://, which represents a string that looks exactly like it (simple?). ）。 Finally, we see the second subexpression, which has a plus sign in addition to the "no reference" operator, which consists of square brackets.

In a regular expression, brackets represent only "find matching characters for any character in square brackets." However, if the first character is an insertion mark (^), the meaning is the opposite, and it means "find any character that does not match the following character." Therefore, we are looking for characters that are not spaces, CR, LF, "<" or ")" (also note that you can avoid characters that have special meaning for regular expressions by adding "\" before the character). The plus sign at the end indicates "find one or more matches for the last string" (the asterisk indicates "0 or more" and the question mark means "0 or one").

There are a lot of things to understand about this rule expression, but if you look at it a few more times, you'll see that this is the format of the URL.

Then there is the replacement part. This is much simpler. What looks like \\1 and \\2 (or \\3, \\4, if needed) is the "reverse reference" just mentioned. The \\1 (or \\2) representation pattern matches the first (or second) subexpression of the matching expression. The remainder of the replacement part has no special meaning: just some characters that are easily recognized as HTML code. One thing that's more troublesome is matching \\2--it looks like a space character. Someone might ask, "What's The trouble?" Why not insert a space character directly? "Well asked, actually we don't need to do HTML operations." But from an aesthetic standpoint, it's a good idea to keep the HTML output as much as possible from the original text file before it was converted to HTML markup. In particular, let's leave line breaks as newline characters, and spaces are spaces (the TAB key is the TAB key).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A tutorial on using txt2html to implement web filtering agents under Python _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A tutorial on using txt2html to implement web filtering agents under Python _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support