Cute python: Using the Spark module to parse

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark is a powerful, general-purpose parser/compiler framework written in Python. In some ways, Spark offers more than Simpleparse or other Python parsers. However, because it is written entirely in Python, the speed is also slow. In this article, David discusses the Spark module, gives some code samples, explains its usefulness, and provides some suggestions for its application.

Following the previous article in the "Lovely Python" series devoted to Simpleparse, I will continue to introduce some of the basic concepts of parsing in this article and discuss the Spark module. The analytic framework is a rich topic, and it is worth the time to get to know the whole story. These two articles are a good start for both the reader and myself.

In everyday programming, I often need to identify the parts and structures that exist in a text document, including log files, configuration files, bound data, and a more free-form (but semi-structured) report format. All of these documents have their own "small language", which is used to specify what can appear within the document. The way I write these informal parsing tasks is always a bit like a hodgepodge, which includes custom state machines, regular expressions, and context-driven string tests. The pattern in these programs is probably always this: "read some text, figure out what you can do with it, and then maybe read some more text and keep trying." ”

The parser refines the description of parts and structures in a document into concise, clear, and descriptive rules that determine what constitutes the document. Most formal parsers use the variants on the extended Backus paradigm (Extended Backus-naur FORM,EBNF) to describe the "syntax" of the language they describe. Basically, the EBNF syntax assigns a name to a part that you might find in the document, and a larger part is usually made up of smaller parts. The frequency and order in which widgets appear in larger parts is specified by the operator. For example, listing 1 is EBNF syntax typographify.def, which we have seen in the Simpleparse article (the way other tools run slightly differently):

Listing 1. Typographify.def

Para        : = (plain/markup) +
plain       : = (word/whitespace/punctuation) +
whitespace  : = [\t\r\n]+
Alphanums   : = [a-za-z0-9]+
word        : = Alphanums, (wordpunct, alphanums) *, contraction?
Wordpunct   : = [-_]
contraction: = "'", (' AM '/' clock '/' d '/' ll '/' m '/' re '/' s '/' t '/' ve ')
markup      : = Emph/strong/module/code/title
emph        : = '-', plain, '-'
strong      : = ' * ', plain, ' * '
module
  := ' [, Plain, '] '
code        : = "'", Plain, "'"
title       : = ' _ ', plain, ' _ '
punctuation: = (safepunct /mdash)
mdash       : = '-'
safepunct   : = [!@#$%^& () +=|\{}:;<>,.? /"]

Spark Introduction

The Spark parser has something in common with the EBNF syntax, but it divides the parsing/processing process into smaller components than the traditional EBNF syntax allows. The advantage of Spark is that it tweaks the control of each step of the process and provides the ability to insert custom code into the process. If you've read the Simpleparse article in this series, you'll recall that our process is sketchy: 1 generates a complete list of tags from the syntax (and from the source file), and 2 uses the tag list as the data for the custom programming operation.

The disadvantage of Spark compared to standard EBNF tools is that it is lengthy and lacks the direct occurrence of a "+" that denotes the possibility of "*" and denotes a restrictive "?". ）。 The metering character can be used in regular expressions in the Spark notation (tokenizer), and can be simulated by recursive return in the parsing expression syntax. It would be better if Spark allowed the use of metrology in grammatical expressions. Another drawback is that the speed of Spark is much inferior to the Simpleparse base Mxtexttools engine that is used in C.

In the compiling Little Languages in Python (see Resources), the founder of Spark, John Aycock, divides the compiler into four phases. The issues discussed in this article relate only to the previous two-and-a-half stages, due to two reasons, one because of the length of the article, and the other because we will only discuss the same relatively simple "text tag" problem presented in the previous article. Spark can also be used further as a code compiler/interpreter for the full cycle, not just for the "parse and process" task I described. Let's take a look at the four stages of Aycock (abbreviated by reference):

Scanning, also called lexical analysis. Divides the input stream into a column of tokens.

Parsing, also known as grammatical analysis. Make sure that the list of tokens is syntactically valid.

Semantic analysis. Iterate through the abstract syntax tree (abstract syntax tree,ast) one or more times, gathering information and checking the input program makes sense.

Generate code. Iterate through the AST again, this phase may be directly interpreted by C or assembler or output code.

For each phase, Spark provides one or more abstract classes to perform the appropriate steps, as well as providing a rare protocol to make these classes more specific. Spark concrete classes do not redefine or add specific methods just as they do in most inheritance patterns, but rather have two attributes (the general pattern is the same as all stages and various parent schemas). First, most of the work done by a specific class is specified in the method's document string (docstring). The second special protocol is that the set of methods that describe the pattern will be given a unique name indicating its role. The parent class, in turn, contains an introspective (introspective) method that looks for the function of the instance to operate. We will be more aware of this when we refer to the example.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cute python: Using the Spark module to parse

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support