This article mainly introduces how to use the Spark module in Python. it is from the official IBM Technical Documentation. if you need it, refer to the daily programming, I often need to identify components and structures in text documents, including log files, configuration files, bounded data, and more flexible (but semi-structured) formats) report format. All of these documents have their own "little language" that defines what can appear in the document. The methods for writing these informal parsing tasks are always a little like a hodgedge, including custom state machines, regular expressions, and context-driven string tests. The patterns in these programs are always like this: "read some text, find out whether it can be used to do something, and then try to read more text ."
The parser abstracts the descriptions of components and structures in the document into concise, clear, and descriptive rules to determine what composition the document consists. Most formal parsers use variants on the Extended Backus-Naur Form (EBNF) paradigm to describe the "syntax" of the language they describe ". Basically, the EBNF syntax assigns names to the parts that you may find in the document. In addition, larger parts usually consist of smaller parts. The frequency and sequence of a widget appearing in a large widget are specified by the operator. For example, listing 1 is the EBNF syntax typographify. def. we have seen this syntax in the SimpleParse article (other tools run in a slightly different way ):
Listing 1. typographify. def
para := (plain / markup)+plain := (word / whitespace / punctuation)+whitespace := [ \t\r\n]+alphanums := [a-zA-Z0-9]+word := alphanums, (wordpunct, alphanums)*, contraction?wordpunct := [-_]contraction := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')markup := emph / strong / module / code / titleemph := '-', plain, '-'strong := '*', plain, '*'module := '[', plain, ']'code := "'", plain, "'"title := '_', plain, '_'punctuation := (safepunct / mdash)mdash := '--'safepunct := [!@#$%^&()+=|\{}:;<>,.?/"]
Spark introduction
The Spark parser has some similarities with the EBNF syntax, but it divides the parsing/processing process into smaller components than the traditional EBNF syntax allows. The advantage of Spark is that it fine-tuned the control of each operation in the whole process and provided the ability to insert custom code into the process. If you have read the SimpleParse article in this series, you will recall that our process is rough: 1) generate a complete tag list from the syntax (and from the source file, 2) use the tag list as the data for custom programming operations.
Compared with the standard EBNF-based tool, Spark has the disadvantage that it is lengthy and lacks a metering indicator (indicating the existence of "+ ", "*" indicating the possibility and "?" indicating the restriction "). The metering indicator can be used in the regular expression of the Spark tokenizer, and it can be simulated by recursion in the parsing expression syntax. It would be better if Spark allows metering in Syntax Expressions. Another drawback worth mentioning is that Spark is much slower than the C-based underlying mxTextTools engine used by SimpleParse.
In "Compiling Little programming ages in Python" (see references), Spark founder John Aycock divides the compiler into four phases. The problems discussed in this article only involve the first two half-stages, which are attributed to two reasons: first, due to the length constraints of the article, second, we will only discuss the relatively simple "text Mark" mentioned in the previous article. Spark can also be further used as a full-cycle code compiler/interpreter, rather than just for the "parse and process" task I described. Let's take a look at the four stages mentioned by Aycock (some are deleted when referencing ):
- Scan, also known as lexical analysis. Divides the input stream into a column of marks.
- Parsing, also known as syntax analysis. Make sure that the Mark list is syntactically valid.
- Semantic analysis. Traverse abstract syntax tree (AST) one or more times, collect information and check the input program makes sense.
- Generate code. Traverse AST again. at this stage, C or assembly may be used to directly explain the program or output code.
For each stage, Spark provides one or more abstract classes to execute the corresponding steps. It also provides a rare protocol to authenticate these classes. Spark classes do not just redefine or add specific methods as they do in most inheritance modes, but have two features (the general mode is the same as each stage and various parent modes ). First, most of the work completed by a specific class is specified in the docstring of the method. The second special protocol is that the method set in the description mode will be assigned a unique name indicating its role. The parent class, in turn, contains the introspective method for searching for instances. We will be more aware of this point when referring to the example.
Recognize text tags
I have used several other methods to solve the problem. I use a format that I call "intelligent ASCII" for various purposes. This format looks like the protocols developed for email and newsgroup communications. For various purposes, I automatically convert this format to other formats, such as HTML, XML, and LaTeX. I will try again here. To help you understand what I mean, I will use the following short sample in this article:
Listing 2. intelligent ASCII sample (p.txt)
The code is as follows:
Text with * bold *, and-itals phrase-, and [module] -- this
Shocould be a good 'practice run '.
In addition to the content in the sample file, there is also a bit about the format, but not a lot (although there are some nuances about how tags interact with punctuation ).
Generate mark
The first thing our Spark "smart ASCII" parser needs to do is to divide the input text into related parts. We do not want to discuss how to construct tokens to keep them as they are. Later, we will combine the Mark sequence into a resolution tree.
The syntax shown in typographify. def above provides the design guide for Spark lexical analysis programs/scanning programs. Note that we can only use the names that are "primitive" in the scan program phase. That is to say, the (compound) modes that include other named modes must be delayed in the parsing phase. In addition, we can copy the old syntax directly.
Listing 3. worddetail. py Spark script after deletion
class WordScanner(GenericScanner): "Tokenize words, punctuation and markup" def tokenize(self, input): self.rv = [] GenericScanner.tokenize(self, input) return self.rv def t_whitespace(self, s): r" [ \t\r\n]+ " self.rv.append(Token('whitespace', ' ')) def t_alphanums(self, s): r" [a-zA-Z0-9]+ " print "{word}", self.rv.append(Token('alphanums', s)) def t_safepunct(self, s): ... def t_bracket(self, s): ... def t_asterisk(self, s): ... def t_underscore(self, s): ... def t_apostrophe(self, s): ... def t_dash(self, s): ...class WordPlusScanner(WordScanner): "Enhance word/markup tokenization" def t_contraction(self, s): r" (?<=[a-zA-Z])'(am|clock|d|ll|m|re|s|t|ve) " self.rv.append(Token('contraction', s)) def t_mdash(self, s): r' -- ' self.rv.append(Token('mdash', s)) def t_wordpunct(self, s): ...
There is an interesting place here. Wordpattern itself is a perfect scanning program class; however, Spark scanning program classes can be further refined by inheritance: the subregular expression pattern matches before the parent regular expression, and if necessary, sub-methods/regular expressions can overwrite the parent method/regular expression. Therefore, WordPlusScanner will match the TDE before the TDE (some bytes may be obtained first ). The pattern document string allows any regular expression (for example, the. t_contraction () method contains a "backward insert" in the pattern ").
Unfortunately, Python 2.2 breaks the scan program inheritance logic to some extent. In Python 2.2, all defined patterns are matched alphabetically (by name) no matter where they are defined in the inheritance chain. To fix this problem, you can modify a line of code in Spark function _ namelist:
Listing 4. corrected spark. py functions
def _namelist(instance): namelist, namedict, classlist = [], {}, [instance.__class__] for c in classlist: for b in c.__bases__: classlist.append(b) # for name in dir(c): # dir() behavior changed in 2.2 for name in c.__dict__.keys(): # <-- USE THIS if not namedict.has_key(name): namelist.append(name) namedict[name] = 1 return namelist
I have informed the Spark founder John Aycock of this issue and will fix it in future versions. At the same time, make changes in your own copy.
Let's take a look at what will happen after wordplusoft is applied to the "smart ASCII" sample above. The list it creates is actually a list of Token instances, but they contain a. _ repr _ method, which allows them to display the following information:
Listing 5. use wordpluscript to mark "smart ASCII"
>>> From wordbench import wordplus.pdf
>>> Tokens = WordPlusScanner().tokenize(open('p.txt '). read ())
>>> Filter (lambda s: s <> 'whitespace', tokens)
[Text, with, *, bold, *, and,-, itals, phrase,-, and ,[,
Module,], --, this, shocould, be, a, good, ', practice, run,',.]
It is worth noting that although methods such as. t_alphanums () are identified by the province of Spark based on the prefix "t _", they are still regular expressions. As long as the corresponding mark is met, any additional code in the method will be executed. The. t_alphanums () method contains a very small example of this point, which contains a print statement.
Generate an abstract syntax tree
Finding a Mark really means something, but what really interesting is how to apply the syntax to the Mark list. The parsing phase creates any tree structure based on the Mark list. It only specifies the expression syntax.
Spark has several methods to create an AST. The "Manual" method is the special GenericParser class. In this case, the sub-parser provides many methods in the form of p_foobar (self, args ). Each document string of such a method contains one or more pattern to name allocation. Each method can contain any code to be executed as long as the syntax expression matches.
However, Spark also provides an automatic AST generation method. This style is inherited from the GenericASTBuilder class. All Syntax Expressions are listed in the most advanced method, while. terminal () and. the nonterminal () method can be specially converted into an operation subtree during generation (any other operation can be performed if needed ). The result is still AST, but the parent class will perform most of the work for you. My syntax class is similar to the following:
Listing 6. markupbuilder. py Spark script after deletion
class MarkupBuilder(GenericASTBuilder): "Write out HTML markup based on matched markup" def p_para(self, args): ''' para ::= plain para ::= markup para ::= para plain para ::= para emph para ::= para strong para ::= para module para ::= para code para ::= para title plain ::= whitespace plain ::= alphanums plain ::= contraction plain ::= safepunct plain ::= mdash plain ::= wordpunct plain ::= plain plain emph ::= dash plain dash strong ::= asterisk plain asterisk module ::= bracket plain bracket code ::= apostrophe plain apostrophe title ::= underscore plain underscore ''' def nonterminal(self, type_, args): # Flatten AST a bit by not making nodes if only one child. if len(args)==1: return args[0] if type_=='para': return nonterminal(self, type_, args) if type_=='plain': args[0].attr = foldtree(args[0])+foldtree(args[1]) args[0].type = type_ return nonterminal(self, type_, args[:1]) phrase_node = AST(type_) phrase_node.attr = foldtree(args[1]) return phrase_node
My. p_para () should contain only one set of syntax rules (no code) in its document string ). I decided to use the. nonterminal () method to tile the AST slightly. A "plain" node composed of a series of "plain" subtrees compresses the subtree into a longer string. Similarly, the tag subtree (namely, "emph", "strong", "module", "code", and "title") is folded into a correct type of independent node, and contains a composite string.
As we have already mentioned, Spark syntax clearly lacks one thing: there is no metering operator. With the following rules,
Plain: = plain
We can aggregate child trees of the plain type in pairs. However, I prefer Spark to allow syntax expressions that are more similar to EBNF, as shown below:
Plain: = plain +
Then, we can create the n-ary subtree of "plain as much as possible. In this case, the tree makes it easier for us to start columns, even without sending messages in. nonterminal.
Use tree
The Spark module provides several AST classes. These responsibilities are greater than I need. If you want to get them, GenericASTTraversal and GenericASTMatcher provide a method to traverse the tree. the inheritance protocols used are similar to what we provide for the scanner and parser.
However, it is not very difficult to traverse the tree using only recursive functions. I created some examples in the compressed file prettyprint. py (see references. One of them is showtree (). This function will display an AST with several conventions.
- The depth of drops is displayed on each line.
- There is a broken mark at the beginning of a node that only has a child node (no content)
- The node type is enclosed by double angle brackets.
Let's take a look at the AST generated in the above example:
Listing 7. use wordpluscript to mark "smart ASCII"
>>> from wordscanner import tokensFromFname>>> from markupbuilder import treeFromTokens>>> from prettyprint import showtree>>> showtree(treeFromTokens(tokensFromFname('p.txt'))) 0 <
> 1 - <
> 2 -- <
> 3 --- <
> 4 ---- <
> 5 ----- <
> 6 ------ <
> 7 ------- <
> 8 -------- <
> 9 <
> Text with 8 <
> bold 7 ------- <
> 8 <
> , and 6 <
> itals phrase 5 ----- <
> 6 <
> , and 4 <
> module 3 --- <
> 4 <
> --this should be a good 2 <
> practice run 1 - < > 2 < > .
The tree structure is intuitive, but what should we do if we really want to find the modified tag? Fortunately, you only need a few lines of code to traverse the tree and generate it:
Listing 8. output the tag from AST (prettyprint. py)
def emitHTML(node): from typo_html import codes if hasattr(node, 'attr'): beg, end = codes[node.type] sys.stdout.write(beg+node.attr+end) else: map(emitHTML, node._kids)
The typo_html.py file is the same as that in the SimpleParse article-it only contains a dictionary that maps names to start/end tags. Obviously, we can use the same method except HTML for the tag. If you do not know, the following content will be generated in our example:
Listing 9. HTML output of the entire process
Text with bold, and itals phrase,
and module
--this should be a good
practice run
.
Conclusion
Many Python programmers recommend Spark to me. Although the rare protocols used by Spark are not easy to get used to, and the documentation may be vague in some ways, the power of Spark is still amazing. The programming style implemented by Spark enables end programmers to insert code blocks anywhere during scanning/parsing-this is typically a "black box" for end users ".
Compared to all its advantages, I found that the real disadvantage of Spark is its speed. Spark is the first Python program I have used, and I found that the speed loss of the explanatory language is the main problem. Spark is indeed very slow. it is not just "I hope it can be a little faster", but "I hope it will end soon after a long lunch. In my experiment, the token generator is still relatively fast, but the parsing process is very slow, even with a small test case. To be fair, John Aycock has pointed out to me that the Earley parsing algorithm used by Spark is much more comprehensive than the simpler LR algorithm, which is the main reason for its slow speed. It is also possible that due to my lack of experience, I may design inefficient syntaxes. However, even so, most users may be like me.