Tutorials for parsing using the Simpleparse module in Python

Source: Internet
Author: User
Like most programmers, I often need to identify the parts and structures that exist in a text document, including: Log files, configuration files, delimited data, and a more free-form (but semi-structured) report format. All of these documents have their own "small language" that specifies what can appear within the document.

The way I write programs that deal with these informal parsing tasks is always a bit of a hodgepodge, including custom state machines, regular expressions, and context-driven string tests. The patterns in these programs are probably always the same: "read some text, figure out what you can do with it, and then maybe read some more text and try it all the time." ”

Various forms of parsers refine the description of parts and structures in a document into concise, clear, and descriptive rules that specify how components of a document are identified. Here, the explanatory aspect is the most compelling. All of my old, special parsers use this style: read some characters, make decisions, add some variables, empty, repeat. As commented in the section on functional programming in this column, the method style of program flow is relatively error-prone and difficult to maintain.

Formal parsers almost always use variants on the extended Backus paradigm (Extended backus-naur Form (EBNF)) to describe the "syntax" of the language they describe. The tools we're working on here are doing this, and the popular Compiler development tool YACC (and its variants) do the same. Basically, the EBNF syntax assigns names to parts that you might find in the document, and often smaller parts are made up of larger parts. By operator-usually the same symbol you see in the regular expression-to specify the frequency and order in which the widget appears in the larger part. In parser Chat (Parser-talk), each named part of the syntax is called a "product (production)".

Perhaps the reader does not even know EBNF, but has seen the EBNF description of the run. For example, the familiar Python language reference (Python Language Reference) defines what a floating-point number looks like in Python:
Floating-point description of the EBNF style

Floatnumber:pointfloat | Exponentfloat
Pointfloat: [Intpart] fraction | Intpart "."
Exponentfloat: (nonzerodigit digit* | pointfloat) exponent
Intpart:nonzerodigit digit* | "0"
Fraction: "." Digit+
Exponent: ("E" | " E ") [" + "|" -"] digit+

Or you may have seen an XML DTD element defined in the EBNF style. For example, the DeveloperWorks tutorial is similar to the following:
Description of the EBNF style in the DeveloperWorks DTD

The code is as follows:

The spelling is slightly different, but the general concepts of quantization, alternation, and sequencing exist in all EBNF-style language grammars.
Building a list of tokens using Simpleparse

Simpleparse is an interesting tool. To use this module, you need the underlying module mxtexttools, which implements a "tag engine" in C. Mxtexttools (see Resources later in this article) is powerful, but rather difficult to use. Once the simpleparse has been placed on the Mxtexttools, the work is much simpler.

Using Simpleparse is really simple, because you don't need to think about most of the complexities of mxtexttools. First, you should create a EBNF-style syntax that describes the language to be processed. The second step is to call Mxtexttools to create a list of tokens that describe all successful products when the syntax is applied to the document. Finally, use the list of tokens returned by Mxtexttools to do the actual work.

For this article, the "language" we want to parse is a set of tag code used by "smart ASCII" that represents things like boldface, module names, and book titles. This is the same language that was previously identified using Mxtexttools, with regular expressions and state machines in the previous section. The language is much simpler than a complete programming language, but is complex enough to be representative.

Here, we may need to look back. What is the "tag list" provided to us by Mxtexttools? This is basically a nested structure that gives only the character offsets that each product matches in the source text. Mxtexttools quickly iterates through the source text, but it does nothing to the source text itself (at least if you use the Simpleparse syntax without any action). Let's look at a simplified list of tokens:
List of tokens generated from the Simpleparse syntax

(1, [(' Plain ',  0, +, [('  word ', 0, 4, [(' Alphanums ', 0, 4, []]]), (  ' whitespace ', 4, 5, []),  (' word ', 5, ten, [(' Alphanums ', 5, []]),  (' whitespace ', ten, One, []),  (' word ', one-by-one, [(' Alphanums ', one-by-one, []]), 
  
    (' whitespace ', [+], []]), (' markup ',  27, ... 289)
  

The middle ellipsis represents a number of more matches. But the sections we see describe the following. The root product ("Para") succeeds and ends at an offset of 289 (the length of the source text). The offset of the sub-product "plain" is 0 to 15. The "plain" sub-product itself is made up of smaller products. After the "plain" product, the "markup" product has an offset of 15 to 27. The details are omitted here, but the first "markup" is made up of components and another product succeeds later in the source text.

EBNF-style syntax for "Smart ASCII"

We have browsed the list of tokens that Simpleparse + mxtexttools can provide. But we do need to look at the syntax used to generate this list of tokens. The actual work takes place in the grammar. The EBNF grammar is barely readable (although it does require a little thought and testing to design a grammar):
Typographify.def

Para      : = (plain/markup) +plain     : = (word/whitespace/punctuation) +whitespace   : = [\t\r\n]+alphanums   : = [A-za-z0-9]+word      : = Alphanums, (wordpunct, Alphanums) *, contraction?wordpunct   : = [-_]contraction  : = "'", (' AM '/' clock '/' d '/' ll '/' m '/' re '/' s '/' t '/' ve ') markup     : = emph/strong/module/code/titleemph      : = '-', plain, '- ' Strong     : = ' * ', plain, ' * ' module     : = ' [', Plain, '] ' code      : = ' ' ', plain, ' ' ' title     : = ' _ ', plain, ' _ ' punct Uation  : = (safepunct/mdash) mdash     : = '--' safepunct   : = [!@#$%^& () +=|\{}:;<>,.? /"]

This syntax is almost identical to the way you verbally describe "smart ASCII" and is very clear. A paragraph consists of some plain text and some markup text. Plain text consists of a collection of words, blanks, and punctuation marks. The markup text may be accent text, accent text or module name, and so on. Emphasis on text is surrounded by asterisks. The markup text is made up of such parts. There are several characteristics to consider, similar to what the word is, or what symbol can be used to end the abbreviation, but EBNF's syntax will not be an obstacle.

In contrast, a regular expression can be used to describe a similar rule more concise. This is how the first version of the Smart ASCII tag program is done. But it is much harder to write this sort of refinement, and it will be more difficult to adjust later. The following code represents a largely (but imprecise) same set of rules:
Python Regexs with Smart ASCII

# [Module] names    re_mods =      R "" ([\ (\s '/">]|^) \[(. *?) \] ([<\s\.\),:; ' "?! /-]) "" "# *strongly emphasize* words    Re_strong =     R" "([\ (\s '/"]|^) \* (. *?) \* ([\s\.\),:; ' "?! /-]) "" "#-emphasize-words    re_emph =      r" "' ([\ (\s '/"]|^)-(. *?) -([\s\.\),:; ' "?! /]) "" "# _book title_ citations    re_title =     R" "([\ (\s '/"]|^) _ (. *?) _ ([\s\.\),:; ' "?! /-]) "" "" # ' Function () "names    Re_funcs =     R" "([\ (\s/"]|^) ' (. *?) ' ([\s\.\),:; "?! /-])"""

If you find or invent some kind of slightly updated variant of the language, it is much simpler to use it with EBNF syntax than with those regular expressions. In addition, it is common to use Mxtexttools to perform mode operations even faster.

Build and use a list of tokens

For the sample program, we put the actual syntax in a separate file. For most purposes, this organization is better and easier to use. In general, changing the syntax and changing the application logic are different kinds of tasks, and these files reflect this. But all we do with the syntax is to pass it as a string to the Simpleparse function, so we can basically include it in the main application (or even generate it dynamically in some way).

Let's look at the complete (Simplified) tagging application:
typographify.py

Import     os    from     sys     import     stdin, stdout, stderr    from     simpleparse     import     Generator    from     mx. Texttools     import     texttoolsinput = Stdin.read () decl = open (    ' typographify.def '    ). Read ()    From     typo_html     import     codesparser = Generator.buildparser (decl). Parserbyname (    ' para '    ) TagList = Texttools.tag (input, parser)     for tags, beg, end, parts     in     taglist[1]:      if     tag = =     ' plain '    :    stdout.write (Input[beg:end])      elif     Tag = =     ' markup '    :    markup = Parts[0]    Mtag, Mbeg, mend = Markup[:3]    start, stop = Codes.get (Mtag, (    ' '
 
      ,    '
 
   '    )    stdout.write (start + input[mbeg+1:mend-1] + stop) stderr.write (    ' parsed%s chars of%s\ N '     % (Taglist[-1], Len (input)))

That's what it does. Read the syntax first, and then create a Mxtexttools parser based on the syntax. Next, we'll apply the tag table/parser to the input source to create a list of tokens. Finally, we iterate through the list of tokens and emit some new markup text. Of course, the loop can do whatever else we expect for each product that we encounter.

Because of the special syntax used by smart ASCII, any content in the source text can be categorized as "plain" products or "markup" products. Therefore, it is sufficient for looping through a single level in the tag list (unless we are looking for a level that is lower than a specific marker product level, such as "title"). But the more liberal syntax-such as the syntax that appears in most programming languages-makes it easy to recursively move down the list of tokens and look for product names at each level. For example, if you allow nested tag code in one syntax, you might be able to use this recursive style. You might like to learn how to adjust the grammar (hint: Remember to allow each product to be recursive to each other).

The special tag code that goes to the output is still stored in another file, because of the Organization's cause, not the essential reason. Here we use a trick, which is to use a dictionary as a switch statement (although the otherwise case in the example is still too narrow). The idea is that in the future we might want to create a variety of "output format" files, such as HTML, DocBook, LaTeX, or other formats. The special markup file for the example is similar to the following:
typo_html.py

codes = \{     ' emph '      : (    '    ,     '    ),     ' strong '     : ('        ,     ' '     module '    : ('), ' Code ': (') ' , '     '    ,     '    ),     ' title '      : (    '    ,     '    ),}

It is easy to extend this format to other output formats.

Conclusion

Simpleparse provides a concise and very easy-to-read EBNF-style wrapper for the basic functions and speeds of a mxtexttools C module that has a vague meaning. In addition, even if just by the way, many programmers are already quite familiar with EBNF grammar. I cannot provide proof of what is easier to understand-this is different from the intuition of each person-but I can give a quantitative assessment based on the length of the source code. The size of the previously hand-developed Mxtypographify module is as follows:

The code is as follows:

WC mxtypographify.py

199 776 7041 mxtypographify.py

In these 199 rows, a considerable number of lines are comments. The 18 rows in these rows are the regular expression versions that the markup function contains, which is used for timing comparisons. However, the function of the program is basically the same as the typographify.py listed above. In contrast, our Simpleparse program, including its supporting files, is sized as follows:

The code is as follows:

WC Typo*.def typo*.py

645 Typographify.def
721 typographify.py
6 205 typo_html.py
183 1571 Total

In other words, the number of rows is about One-fourth of the former. This version has fewer comments, but that's mostly because the EBNF grammar has a strong self-description. I don't want to be too stressed about the number of lines of code-obviously, you can tamper with minimizing or maximizing the length of the code. However, the work of the programmer is usually studied, and one of the few practical conclusions is that the "thousand lines of code/person month" is quite close to the constant, and the language and the library have little relation. Of course, in turn, the regular expression version is One-third of the length of the Simpleparse version--but I think the density of its expression makes it extremely difficult to maintain and harder to write. In a word, I think simpleparse is the best method to consider.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.