In Python, The SimpleParse module is used for parsing.

Source: Internet
Author: User

In Python, The SimpleParse module is used for parsing.

Like most programmers, I often need to identify Components and structures in text documents that include: log Files, configuration files, separated data, and more flexible (but semi-structured) report formats. All of these documents have their own "Little language" that defines what can appear in the document.

The method I write to process these informal parsing tasks is always a bit like a hodgedge, including custom state machines, regular expressions, and context-driven string tests. The patterns in these programs are always like this: "read some text, find out whether it can be used to do something, and then try to read more text ."

The Parser of various forms abstracts the descriptions of parts and structures in the document into concise, clear, and descriptive rules that specify how to identify the components of the document. Here, the descriptive aspect is the most striking. All of my old special Resolvers adopt this style: read some characters, make decisions, accumulate some variables, clear and repeat. As described in some functional programming articles in this column, the method style of the Program Stream is relatively error-prone and difficult to maintain.

The formal parser almost always uses variants on the Extended Backus-Naur Form (EBNF) paradigm to describe the "syntax" of the language they describe ". The tool we are studying here is like this, and the popular compiler development tool YACC (and its variants) is also like this. Basically, the EBNF syntax assigns names to the parts you may find in the document. In addition, smaller parts are often made up of larger parts. The operator-usually the same as the symbol you see in the regular expression-specifies the frequency and sequence of the appearance of widgets in larger widgets. In parser-talk, each part named in the syntax is called a "product )".

Readers may not even know EBNF, but have seen the running EBNF description. For example, the familiar Python Language Reference defines what floating point numbers look like in Python:
Floating Point description of the EBNF Style

Floatnumber: pointfloat | exponentfloat
Pointfloat: [intpart] fraction | intpart "."
Exponentfloat: (nonzerodigit digit * | pointfloat) exponent
Intpart: nonzerodigit digit * | "0"
Fraction: "." digit +
Exponent: ("e" | "E") ["+" | "-"] digit +

Alternatively, you may have seen the xml dtd elements defined in the EBNF style. For example, the <body> In the developerWorks tutorial is similar:
Description of the EBNF style in the developerWorks DTD

Copy codeThe Code is as follows: <! ELEMENT body (example-column | image-column )?, Text-column)>

The spelling is slightly different, but the general concepts of quantization, alternation, and ordering exist in all language syntaxes of EBNF styles.
Use SimpleParse to build a tag list

SimpleParse is an interesting tool. To use this module, you need the underlying module mxTextTools, which uses C to implement a "tag engine ". MxTextTools (see references later in this article) has powerful functions, but is quite difficult to use. Once SimpleParse is placed on mxTextTools, the work is much easier.

It is really easy to use SimpleParse, because most of the complexity of mxTextTools is not required. First, you should create an EBNF style syntax to describe the language to be processed. The second step is to call mxTextTools to create a tag list. When the syntax is applied to a document, this list describes all successful products. Finally, use the tag list returned by mxTextTools for actual operations.

For this article, the "language" we want to resolve is a group of markup codes used by "intelligent ASCII", which are used to represent content such as the simhei, Module name, and book title. This is the same language previously identified by mxTextTools. In the previous section, regular expressions and state machines are used. This language is much simpler than the complete programming language, but it is complex and representative enough.

Here, we may need to review it. What is the "tag list" provided by mxTextTools? This is basically a nested structure, which only gives the character offset that each product matches in the source text. MxTextTools quickly records source text, but does not perform any operations on the source text itself (at least when SimpleParse syntax is used ). Let's look at a simplified tag list:
Tag list generated from SimpleParse syntax

(1, [('plain',  0,  15,  [('word', 0, 4, [('alphanums', 0, 4, [])]),  ('whitespace', 4, 5, []),  ('word', 5, 10, [('alphanums', 5, 10, [])]),  ('whitespace', 10, 11, []),  ('word', 11, 14, [('alphanums', 11, 14, [])]),  ('whitespace', 14, 15, [])]), ('markup',  15,  27, ... 289)

The ellipsis in the middle indicates a batch of more matches. However, the following content is described. The root product ("para") is successful and ends at the offset of 289 (the length of the source text ). The offset of the sub-product "plain" is 0 to 15. The "plain" sub-product itself consists of smaller products. After the "plain" product, the offset of the "markup" product is 15 to 27. The details are omitted here, but the first "markup" is composed of components, and other products will be successful later in the source text.

"Intelligent ASCII" EBNF style syntax

We have browsed the tag list provided by SimpleParse + mxTextTools. However, we do need to study the syntax used to generate the tag list. The actual work occurs in the syntax. The EBNF syntax does not need to be explained in terms of reading it (although a bit of thinking and testing are required to design a syntax ):
Typographify. def

para      := (plain / markup)+plain     := (word / whitespace / punctuation)+whitespace   := [ \t\r\n]+alphanums   := [a-zA-Z0-9]+word      := alphanums, (wordpunct, alphanums)*, contraction?wordpunct   := [-_]contraction  := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')markup     := emph / strong / module / code / titleemph      := '-', plain, '-'strong     := '*', plain, '*'module     := '[', plain, ']'code      := "'", plain, "'"title     := '_', plain, '_'punctuation  := (safepunct / mdash)mdash     := '--'safepunct   := [!@#$%^&()+=|\{}:;<>,.?/"]

This syntax is almost identical to the way you verbally describe "intelligent ASCII", which is very clear. A paragraph consists of plain text and some marked text. Plain text consists of a collection of certain characters, spaces, and punctuation marks. The marked text may emphasize the text, emphasize the text or module name, and so on. Emphasize that the text is surrounded by asterisks. Markup text is composed of such parts. Several features need to be considered, similar to what is "word" or what can be used to end the abbreviation, but the syntax of EBNF will not become a barrier.

In contrast, regular expressions can be used to describe similar rules in a more refined manner. This is the first version of the "smart ASCII" flag program. However, it is much more difficult to compile this kind of refinement, and it will be more difficult to adjust it in the future. The following code indicates the same rule set to a large extent (but not precisely:
Intelligent ASCII Python regexs

# [module] names    re_mods =      r""'([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;'"?!/-])"""# *strongly emphasize* words    re_strong =     r""'([\(\s'/"]|^)\*(.*?)\*([\s\.\),:;'"?!/-])"""# -emphasize- words    re_emph =      r""'([\(\s'/"]|^)-(.*?)-([\s\.\),:;'"?!/])"""# _Book Title_ citations    re_title =     r""'([\(\s'/"]|^)_(.*?)_([\s\.\),:;'"?!/-])"""# 'Function()" names    re_funcs =     r""'([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])"""

If you discover or invent a slightly updated variant of the language, it is much easier to use it with the EBNF syntax than to use it with those regular expressions. In addition, mxTextTools is usually used to perform operations in a faster mode.

Generate and use a tag list

For sample programs, we place the actual syntax in a separate file. For most purposes, this type of organization is better and easier to use. Generally, changing the syntax and changing the application logic are different types of tasks; these files reflect this. However, all we do with the syntax is to pass it as a string to the SimpleParse function, therefore, we can include it in the main application (or even dynamically generate it in some way ).

Let's look at the complete (simplified) markup application:
Typographify. py

import     os    from     sys     import     stdin, stdout, stderr    from     simpleparse     import     generator    from     mx.TextTools     import     TextToolsinput = stdin.read()decl = open(    'typographify.def'    ).read()    from     typo_html     import     codesparser = generator.buildParser(decl).parserbyname(    'para'    )taglist = TextTools.tag(input, parser)    for     tag, beg, end, parts     in     taglist[1]:      if     tag ==     'plain'    :    stdout.write(input[beg:end])      elif     tag ==     'markup'    :    markup = parts[0]    mtag, mbeg, mend = markup[:3]    start, stop = codes.get(mtag, (    '<!-- unknown -->'    ,    '<!-- / -->'    ))    stdout.write(start + input[mbeg+1:mend-1] + stop)stderr.write(    'parsed %s chars of %s\n'     % (taglist[-1], len(input)))

This is what it does. First, read the syntax, and then create an mxTextTools parser Based on the syntax. Next, we will apply the tag table/parser to the input source to create a tag list. Finally, we cyclically traverse the tag list and issue some new tag text. Of course, this loop can do whatever we expect for every product we encounter.

Due to the special syntax used by intelligent ASCII, any content in the source text can be classified as "plain" product or "markup" product. Therefore, it is sufficient for a single level in the cyclic traversal tag list (unless we are looking for a level at the lower level of BITs tag product level, such as "title "). But syntax with a more free format-for example, the syntax that appears in most programming languages-can be easily recursively typed in the tag list and searched for the product name at each level. For example, if a syntax allows nesting of tag code, this recursive style may be used. You may like to figure out how to adjust the syntax (Note: Remember to allow products to recursion ).

The special tag code that is transferred to the output is stored in another file, which is not an essential reason for the Organization. Here we use a technique that uses a dictionary as a switch statement (although the otherwise condition in the example is too narrow ). In the future, we may want to create multiple "output formats", such as HTML, DocBook, LaTeX, or other formats. The special Tag file used for the example is similar:
Typo_html.py

codes = \{     'emph'      : (    '<em>'    ,     '</em>'    ),     'strong'     : (    '<strong>'    ,     '</strong>'    ),     'module'     : (    '<em><code>'    ,     '</code></em>'    ),     'code'      : (    '<code>'    ,     '</code>'    ),     'title'      : (    '<cite>'    ,     '</cite>'    ),}

It is easy to extend this format to other output formats.

Conclusion

SimpleParse provides a concise and easy-to-read EBNF-style package for the basic functions and speed of the mxTextTools C Module. In addition, many programmers are already quite familiar with EBNF syntax, even though they just learned it by the way. I cannot provide proofs about what is easier to understand-this varies with intuition-but I can give quantitative evaluations based on the source code length. The size of the previously developed mxTypographify module is as follows:
Copy codeThe Code is as follows: wc mxTypographify. py

199 776 7041 mxTypographify. py

A considerable number of rows in the 199 rows are comments. 18 rows in these rows are the regular expression versions included in the tag function, including the tag function for timing comparison. However, the function of this program is basically the same as that of typographify. py listed above. In contrast, our SimpleParse program, including its supporting files, is as follows:
Copy codeThe Code is as follows: wc typo *. def typo *. py

19 79 645 typographify. def
20 79 721 typographify. py
6 25 205 typo_html.py
45 183 1571 total

In other words, the number of rows is only about 1/4 of the former. This version has less comments, but it is mainly because EBNF has a strong ability to describe itself. I don't want to emphasize the number of lines of code too much-obviously, you can minimize or maximize the length of code. However, the programmer's work is usually studied. One of the few practical conclusions is that "thousands of lines of code/man-month" is very close to constants and has little to do with languages and libraries. Of course, in turn, the regular expression version is 1/3 of the SimpleParse version length-but I think the density of its expressions makes it very difficult to maintain and more difficult to write. All in all, I think SimpleParse is the best method to consider.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.