A tutorial on using spark modules in Python _python

Source: Internet
Author: User
Tags truncated in python

In everyday programming, I often need to identify the parts and structures that exist in a text document, including log files, configuration files, bound data, and a more free-form (but semi-structured) report format. All of these documents have their own "small language", which is used to specify what can appear within the document. The way I write these informal parsing tasks is always a bit like a hodgepodge, which includes custom state machines, regular expressions, and context-driven string tests. The pattern in these programs is probably always this: "read some text, figure out what you can do with it, and then maybe read some more text and keep trying." ”

The parser refines the description of parts and structures in a document into concise, clear, and descriptive rules that determine what constitutes the document. Most formal parsers use the variants on the extended Backus paradigm (Extended Backus-naur FORM,EBNF) to describe the "syntax" of the language they describe. Basically, the EBNF syntax assigns a name to a part that you might find in the document, and a larger part is usually made up of smaller parts. The frequency and order in which widgets appear in larger parts is specified by the operator. For example, listing 1 is EBNF syntax typographify.def, which we have seen in the Simpleparse article (the way other tools run slightly differently):

Listing 1. Typographify.def

Para    : = (plain/markup) +
plain    : = (word/whitespace/punctuation) +
whitespace: = [\t\r\n]+
Alphan UMS  : = [a-za-z0-9]+
word    : = Alphanums, (wordpunct, alphanums) *, contraction?
Wordpunct  : = [-_]
contraction: = "'", (' AM '/' clock '/' d '/' ll '/' m '/' re '/' s '/' t '/' ve ')
markup   : = Emph/strong/module/code/title
emph    : = '-', plain, '-'
strong   : = ' * ', plain, ' * '
module
   := ' [, Plain, '] '
code    : = "'", Plain, "'"
title    : = ' _ ', plain, ' _ '
punctuation: = (safepunct /mdash)
mdash    : = '-'
safepunct  : = [!@#$%^& () +=|\{}:;<>,.? /"]

Spark Introduction

The Spark parser has something in common with the EBNF syntax, but it divides the parsing/processing process into smaller components than the traditional EBNF syntax allows. The advantage of Spark is that it tweaks the control of each step of the process and provides the ability to insert custom code into the process. If you've read the Simpleparse article in this series, you'll recall that our process is sketchy: 1 generates a complete list of tags from the syntax (and from the source file), and 2 uses the tag list as the data for the custom programming operation.

The disadvantage of Spark compared to standard EBNF tools is that it is lengthy and lacks the direct occurrence of a "+" that denotes the possibility of "*" and denotes a restrictive "?". )。 The metering character can be used in regular expressions in the Spark notation (tokenizer), and can be simulated by recursive return in the parsing expression syntax. It would be better if Spark allowed the use of metrology in grammatical expressions. Another drawback is that the speed of Spark is much inferior to the Simpleparse base Mxtexttools engine that is used in C.

In the compiling Little Languages in Python (see Resources), the founder of Spark, John Aycock, divides the compiler into four phases. The issues discussed in this article relate only to the previous two-and-a-half stages, due to two reasons, one because of the length of the article, and the other because we will only discuss the same relatively simple "text tag" problem presented in the previous article. Spark can also be used further as a code compiler/interpreter for the full cycle, not just for the "parse and process" task I described. Let's take a look at the four stages of Aycock (abbreviated by reference):

    • Scanning, also called lexical analysis. Divides the input stream into a column of tokens.
    • Parsing, also known as grammatical analysis. Make sure that the list of tokens is syntactically valid.
    • Semantic analysis. Iterate through the abstract syntax tree (abstract syntax tree,ast) one or more times, gathering information and checking the input program makes sense.
    • Generate code. Iterate through the AST again, this phase may be directly interpreted by C or assembler or output code.

For each phase, Spark provides one or more abstract classes to perform the appropriate steps, as well as providing a rare protocol to make these classes more specific. Spark concrete classes do not redefine or add specific methods just as they do in most inheritance patterns, but rather have two attributes (the general pattern is the same as all stages and various parent schemas). First, most of the work done by a specific class is specified in the method's document string (docstring). The second special protocol is that the set of methods that describe the pattern will be given a unique name indicating its role. The parent class, in turn, contains an introspective (introspective) method that looks for the function of the instance to operate. We will be more aware of this when we refer to the example.

Identifying text Markers

I have solved the problem here in several other ways. I use a format I call "Smart ASCII" for a variety of purposes. This format looks much like those developed for e-mail and newsgroup communications. For a variety of purposes, I automatically convert this format to other formats, such as HTML, XML, and LaTeX. I'm going to do it again here. To give you an intuitive understanding of what I mean, I'll use this short sample in this article:

Listing 2. Smart ASCII Sample text (p.txt)

Copy Code code as follows:
Text with *bold*, And-itals phrase-, and [Module]--this
Should be a good ' practice run '.

In addition to the content in the sample file, there is another point about the format, but not a lot (although there are some nuances about how tags and punctuation interact).

Generate token

The first thing we need to do with our Spark "smart ASCII" parser is to divide the input text into related parts. Given this layer of notation, we don't want to discuss how to construct the notation so that it stays the same. Later we will combine the sequence of tokens into a parse tree.

The syntax shown in the above Typographify.def provides a design guide to the Spark lexical Analyzer/scanner. Note that we can only use names that are "primitive" in the scanner phase. That is, those (composite) patterns that include other named schemas must be deferred in the parsing phase. In addition to this, we can actually copy the old grammar directly.

Listing 3. truncated wordscanner.py Spark Script

 class Wordscanner (Genericscanner): "Tokenize words, punctuation and markup" Def Tokeni 
    Ze (self, input): SELF.RV = [] genericscanner.tokenize (self, input) return SELF.RV def t_whitespace (self, s): R "[\t\r\n]+" Self.rv.append (Token (' whitespace ', ') def t_alphanums (self, s): R "[a-za-z0-9]+" PR int "{word}", Self.rv.append (Token (' alphanums ', s)) def t_safepunct (self, s): ... def t_bracket (self, s): ... de F T_asterisk (self, s): ... def t_underscore (self, s): ... def t_apostrophe (self, s): ... def t_dash (self, s): ... cl Ass Wordplusscanner (Wordscanner): "Enhance Word/markup tokenization" def t_contraction (self, s): R "(? <=[a-za-z ] "(am|clock|d|ll|m|re|s|t|ve)" Self.rv.append (Token (' contraction ', s)) def t_mdash (self, s): R '--' self. 
Rv.append (Token (' mdash ', s)) def t_wordpunct (self, s): ... 

There's an interesting place here. Wordscanner itself is a perfect scanner class, but the Spark scanner class itself can be further specific by inheritance: Zhong The expression pattern matches the parent regular expression, and if necessary, the child method/Regular expression can overwrite the parent method/regular expression. Therefore, Wordplusscanner will match the specificity before Wordscanner (some bytes may be obtained first). Any regular expression is allowed in the pattern document string (for example, the. T_contraction () method contains a "backward insert" in the pattern).

Unfortunately, Python 2.2 destroys the scanner inheritance logic to some extent. In Python 2.2, all defined patterns are matched alphabetically (by name), regardless of where they are defined in the inheritance chain. To fix this problem, you can modify one line of code in the Spark function _namelist ():

Listing 4. Correct the corresponding spark.py function

  def _namelist (instance):
  namelist, namedict, classlist = [], {}, [instance.__class__] for
  C in classlist:
    For b in c.__bases__:
      classlist.append (b)
    # for name in Dir (c):  # dir () behavior changed in 2.2 for
    name I n C.__dict__.keys (): # <--Use the
      if not Namedict.has_key (name):
        namelist.append (name)
        namedict[ Name] = 1 return
  namelist

I have informed Spark founder John Aycock that the issue will be corrected in future versions. Also, please make changes in your own copy.

Let's see what happens when Wordplusscanner is applied to the "smart ASCII" sample above. The list it creates is actually a list of Token instances, but they contain a. __repr__ method that enables them to display the following information well:

Listing 5. Using Wordplusscanner to assign tokens to "smart ASCII"

>>> from Wordscanner import Wordplusscanner
>>> tokens = Wordplusscanner (). Tokenize (Open (' p.txt '). Read ())
>>> filter (lambda s:s<> ' whitespace ', tokens)
[Text, with, *, Bold, *,,, and,-itals, phrase,-,,, and, [,
module,],--, this, should, being, a, good, ', practice, run, ',.]

It is worth noting that, although methods such as T_alphanums () are identified by Spark introspection according to their prefix "t_", they are also regular methods. Whenever a corresponding token is encountered, any additional code within the method is executed. The. T_alphanums () method contains a small example of this point that contains a print statement.

Generate abstract Syntax tree

Looking for tokens does have a little meaning, but what's really interesting is how to apply syntax to a list of tokens. The parsing phase creates an arbitrary tree structure on the basis of a list of tokens. It just specifies the expression syntax.

Spark has several ways to create an AST. The "manual" method is a special Genericparser class. In this case, the specific child parser provides a number of methods, the form of the method name P_foobar (self, args). The document string for each such method contains one or more patterns to the assignment of the name. Each method can contain any code to be executed as long as the syntax expression matches.

However, Spark also provides an "automatic" way to generate an AST. This style inherits from the Genericastbuilder class. All grammatical expressions are listed in one of the most advanced methods, while the. Terminal () and. NonTerminal () methods can be used to manipulate subtrees during a build (or perform any other action if needed). The result is still the AST, but the parent class performs most of the work for you. My grammar classes are similar to those shown below:

Listing 6. truncated markupbuilder.py Spark Script

    Class Markupbuilder (Genericastbuilder): "Write out HTML markup based on matched markup" Def p_para (self, args): 
    "' Para:: = plain para:: = markup para:: = para plain para:: = para emph para:: = para Strong Para:: = para module para:: = para code para:: = para title plain:: = whitespace plain:: = Alphanu 
    Ms Plain:: = Contraction plain:: = safepunct Plain:: = mdash Plain:: = wordpunct Plain:: = Plain Plain emph:: = Dash Plain dash strong:: = Asterisk plain asterisk module:: = Bracket Plain Bracket code:: = A Postrophe Plain apostrophe title:: = Underscore plain underscore ' def nonterminal (self, Type_, args): # F
    Latten AST a bit by isn't making nodes if only one of the child. If Len (args) ==1:return args[0] If type_== ' para ': return nonterminal (self, type_, args) if type_== ' plain ': a Rgs[0].attr = Foldtree (Args[0]) +foldtree (args[1]) Args[0].type = type_ retUrn nonterminal (self, type_, args[:1]) Phrase_node = AST (type_) phrase_node.attr = Foldtree (args[1)) return ph

 Rase_node

My. P_para () should contain only a set of grammar rules (no code) in its document string. I decided to use the. nonterminal () method to tile the AST slightly. The "Plain" node, made up of a series of "plain" subtrees, compresses the subtree into a longer string. Similarly, the tag subtree (that is, "emph", "Strong", "module", "Code", and "title") is collapsed to a separate node of the correct type and contains a composite string.

As we've already mentioned, there's obviously something missing in the Spark syntax: There are no metering characters. By following these rules,

Plain:: = Plain Plain

We can gather the "plain" type of subtree in pairs. But I'm more inclined to let Spark allow the use of syntax expressions that are more similar to the EBNF style, as follows:

Plain:: = plain+

Then, we can more easily create "plain as many" n-ary subtree. In this way, it is easier for our tree to start columns without even sending messages in. nonterminal ().

Using a tree

The Spark module provides several classes that use the AST. These responsibilities are greater than I need for my purposes. If you want them, genericasttraversal and Genericastmatcher provide a way to traverse the tree, using an inheritance protocol similar to what we provide for scanners and parsers.

But using recursive functions to traverse trees is not very difficult. I've created some of these examples in the article's compressed file prettyprint.py (see Resources). One of them is Showtree (). The function will display an AST with several conventions.

    • Each row shows the descent depth
    • Only nodes with child nodes (without content) have dashes at the beginning
    • The node type is enclosed in double angle brackets

Let's take a look at the AST generated in the example above:

Listing 7. Using Wordplusscanner to assign tokens to "smart ASCII"

 >>> from Wordscanner import tokensfromfname >>> from Markupbuilder impo RT Treefromtokens >>> from prettyprint import showtree >>> showtree (Treefromtokens ('  P.txt ')) 0 <<para>> 1-<<para>> 2--<<para>> 3---<<para>> 4---- <<para>> 5-----<<para>> 6------<<para>> 7-------<<para>> 8---- ----<<plain>> 9 <<plain>> Text with 8 <<strong>> bold 7-------<<p    Lain>> 8 <<plain>> 6 <<emph>> itals phrase 5-----<<plain>> 6 <<plain>> and 4 <<module>> Module 3---<<plain>> 4 <<plain>>

--this should be a good 2 <<code>> practice run 1-<<plain>> 2 <<plain>>. 

Understanding the tree structure is intuitive, but what about the modified tags we're really looking for? Luckily, it takes just a few lines of code to traverse the tree and build it:

Listing 8. Output tags from AST (prettyprint.py)

  def emithtml (node):
  from typo_html import codes
  if hasattr (node, ' attr '):
    beg, end = Codes[node.type]
    Sys.stdout.write (beg+node.attr+end)
  else:map (emithtml, Node._kids)

The typo_html.py file is the same as in the Simpleparse article-it just contains a dictionary that maps names to start tag/end tag pairs. Obviously, we can use the same method for markup other than HTML. If you are not clear, here is what our example will produce:

Listing 9. HTML output for the entire process

Text with <strong>bold</strong>, and <em>itals Phrase</em>
And <em><code>module</code></em>--this should be a good
<code>practice Run</code>

Conclusion

Many Python programmers recommend Spark to me. Although the rare protocol used by Spark is not easy to get used to, and the document may be ambiguous from some point of view, the power of Spark is surprisingly impressive. The programming style implemented by Spark enables end programmers to insert blocks of code anywhere in the scan/parse process-which is usually a "black box" for end users.

Compared to all its advantages, I find that the real disadvantage of Spark is its speed. Spark was the first Python program I ever used, and I found in use that interpreting the speed loss of a language is the main problem. The speed of the Spark is really slow, not just "I hope to be a little faster", but "a long lunch and I hope it will come to an end". In my experiment, the token-giving device was relatively fast, but the parsing process was slow, even with very small test cases. To be fair, John Aycock has pointed out to me that the Earley parsing algorithm used by Spark is much more comprehensive than the simpler LR algorithm, which is the main reason for its slow speed. It is also possible that, due to my lack of experience, it may be possible to design inefficient grammar, but even so, most users are likely to be like me.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.