Using 70 lines of Python code to implement a recursive descent parser tutorial, 70 lines of python
Step 1: Tagging
The first step in processing an expression is to convert it to a list containing independent symbols. This step is very simple and not the focus of this article, so I have omitted a lot here.
First, I define some tags (numbers are not in this box, they are the default tags) and a tag type:
token_map = {'+':'ADD', '-':'ADD', '*':'MUL', '/':'MUL', '(':'LPAR', ')':'RPAR'} Token = namedtuple('Token', ['name', 'value'])
The following code marks the 'exp' expression:
split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]
The first line is the technique of dividing expressions into basic tags, so
'1.2 / ( 11+3)' --> ['1.2', '/', '(', '11', '+', '3', ')']
The next row is named, so that the analyzer can identify them by classification:
['1.2', '/', '(', '11', '+', '3', ')']->[Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token(name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]
Any tag that is not in token_map is assumed to be a number. Our tokenizer lacks a property called verification to prevent non-numeric acceptance, but fortunately, it will be processed later.
That's it.
Step 2: syntax definition
The parser I selected is implemented from a local vertical parser, which comes from a simple version of the LL parser. It is the simplest parser implementation. In fact, there are only 14 lines of code. It is a top-down parser, which means that the parser starts to parse (like: expression) from the top-level rules, and then tries to parse it recursively according to its sub-rules, until the minimum rule (like: number) is met ). In other words, when the bottom-up Parser (LR) Gradually contracts the tag so that the rule is included in other rules until only one rule is left, the top-down Parser (LL) gradually expands the rule and enters a few abstract rules until it can fully match the input tag.
We can discuss the syntax before going deep into the actual parser implementation. In my previous article, I used the LR parser. I can define the calculator syntax in the following way (mark with uppercase letters ):
add: add ADD mul | mul;mul: mul MUL atom | atom;atom: NUM | '(' add ')' | neg;neg: '-' atom;
(If you do not understand the above syntax, please read my previous article)
Now I use the LL parser to define the calculator syntax as follows:
rule_map = { 'add' : ['mul ADD add', 'mul'], 'mul' : ['atom MUL mul', 'atom'], 'atom': ['NUM', 'LPAR add RPAR', 'neg'], 'neg' : ['ADD atom'],}
As you can see, there is a subtle change here. The recursive definition of "add and mul" is reversed. This is a very important detail. I will explain it to you in detail.
The LR version uses the left recursive mode. When the LL parser encounters recursion, it will try to match the rule. Therefore, when left recursion occurs, the parser enters infinite recursion. Even a clever LL parser such as anlr cannot escape this problem. It will replace infinite recursion with friendly error prompts, unlike our toy parser.
Left recursion can easily be converted to right recursion. That's what I do. However, the parser is not that simple, and it produces another problem: when left recursion correctly parses 3-2-1 as (3-2)-1, the right recursion is incorrectly parsed as 3-(2-1 ). I didn't think of a simple solution, so in order to make things simple, I decided to let it continue to use the wrong parsing format and handle the problem later (see step 4)
Step 3: Resolve to an AST
The algorithm is actually very simple. We will define a recursive method to receive two parameters: the first parameter is the name of the rule to be matched, and the second parameter is the list of identifiers to be retained. The add (upper-level rule) method contains a complete list of identifiers. recursive calls are clear. The method returns an array containing the following elements: one is the current match, and the other is the list of matched identifiers. We will implement the identification matching function to make this code available (they are all string types; one is in upper case and the other is in lower case ).
The following code is implemented by the Parser:
RuleMatch = namedtuple ('rulesmatch', ['name', 'matched']) def match (rule_name, tokens): if tokens and rule_name = tokens [0]. name: # Do I match the ID? Return RuleMatch (tokens [0], tokens [1:]) for expansion in rule_map.get (rule_name, (): # Do You Want to match the rule? Remaining_tokens = tokens matched_subrules = [] for subrule in expansion. split (): matched, remaining_tokens = match (subrule, remaining_tokens) if not matched: break # Bad luck, jump out of the loop and process the next extension definition! Matched_subrules.append (matched) else: return RuleMatch (rule_name, matched_subrules), remaining_tokens return None, None # No matching result
Lines 4 to 5 of the Code describe: if the rule name (rule_name) is indeed an identifier and is included in the tag list (tokens), check whether it matches the current identifier. If yes, the expression returns a matching method, and the identification list is used.
Line 1 of the Code indicates that iteration checks whether the Sub-rule corresponding to the rule name matches cyclically and recursively matches each sub-rule. If the rule name meets the criteria for matching the identity, the get () method returns an empty array and the code returns a null value (see 16 rows ).
Row 9-15 iterates the current sub-rule and tries to match them sequentially. Each iteration matches as many identifiers as possible. If an ID does not match, we will discard the entire sub-rule. However, if all the identifiers match successfully, we will arrive at the else statement and return the matched value of rule_name, with the remaining identifiers.
Run now and check the results of 1.2/(11 + 3.
>>> tokens = [Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token (name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')] >>> match('add', tokens) (RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='1.2')]), Token(name='MUL', value='/'), RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='LPAR', value='('), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='11')])]), Token(name='ADD', value='+'), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='3')])])])]), Token(name='RPAR', value=')')])])])]), [])
The result is a tuple. Of course we don't see any other identifier. The matching results are not easy to read, so let me draw a picture of the results:
add mul atom NUM '1.2' MUL '/' mul atom LPAR '(' add mul atom NUM '11' ADD '+' add mul atom NUM '3' RPAR ')'
This is the concept of AST. It is a good exercise to imagine how the parser works through your thinking logic or on paper. I dare not say this is a must unless you want. You can use AST to help you implement the correct algorithm.
So far, we have completed the parser that can handle binary operations, unary operations, Parentheses, and operator priority.
There is only one error to be solved. The following steps will solve the error.
Step 4: subsequent handling
My parser is not used in any scenarios. The most important thing is that it cannot process left recursion, forcing me to write the code as right recursion. As a result, when the expression 8/4/2 is parsed, the AST result is as follows:
add mul atom NUM 8 MUL '/' mul atom NUM 4 MUL '/' mul atom NUM 2
If we try to calculate the result through AST, We will calculate 4/2 first, which is of course incorrect. Some LL Resolvers choose to modify the associations in the tree. This requires writing multiple lines of code ;). We need to flat this. The algorithm is simple: For each rule in AST, 1) it needs to be corrected. 2) It is a binary operation (with sub-rules). 3) the operator on the right has the same rule: flatten the latter into the former. Through "flat", I mean to replace this node with its son in the context of its parent node. Because our traversal is post-ordered by DFS, it means that it starts from the edge of the tree and keeps reaching the root of the tree, and the effect will be accumulated. The following code is used:
fix_assoc_rules = 'add', 'mul' def _recurse_tree(tree, func): return map(func, tree.matched) if tree.name in rule_map else tree[1] def flatten_right_associativity(tree): new = _recurse_tree(tree, flatten_right_associativity) if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name: new[-1:] = new[-1].matched return RuleMatch(tree.name, new)
This code changes the addition or multiplication expression of any structure into a plane list (not obfuscated ). Parentheses will break the order. Of course, they will not be affected.
Based on the above, I can reconstruct the left join of the Code:
def build_left_associativity(tree): new_nodes = _recurse_tree(tree, build_left_associativity) if tree.name in fix_assoc_rules: while len(new_nodes)>3: new_nodes[:3] = [RuleMatch(tree.name, new_nodes[:3])] return RuleMatch(tree.name, new_nodes)
However, I will not. I need less code, and replacing the computing code with the processing list requires less code than restructuring the entire tree.
Step 5: Timer
Tree operations are very simple. You only need to traverse the tree in a way similar to the post-processing code (that is, the post-DFS order) and perform operations according to each rule. Because recursive algorithms are used, each rule must contain only numbers and operators. The Code is as follows:
bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}def calc_binary(x): while len(x) > 1: x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ] return x[0] calc_map = { 'NUM' : float, 'atom': lambda x: x[len(x)!=1], 'neg' : lambda (op,num): (num,-num)[op=='-'], 'mul' : calc_binary, 'add' : calc_binary,} def evaluate(tree): solutions = _recurse_tree(tree, evaluate) return calc_map.get(tree.name, lambda x:x)(solutions)
I use the calc_binary function for addition and subtraction (and their same-order operations ). It calculates these operations in the list by combining the left, which makes the LL syntax not easy to obtain results.
Step 6: REPL
The simplest REPL:
if __name__ == '__main__': while True: print( calc(raw_input('> ')) )
Don't let me explain it :)
Appendix: merge them: a 70-line Calculator
'''A Calculator Implemented With A Top-Down, Recursive-Descent Parser'''# Author: Erez Shinan, Dec 2012 import re, collectionsfrom operator import add,sub,mul,div Token = collections.namedtuple('Token', ['name', 'value'])RuleMatch = collections.namedtuple('RuleMatch', ['name', 'matched']) token_map = {'+':'ADD', '-':'ADD', '*':'MUL', '/':'MUL', '(':'LPAR', ')':'RPAR'}rule_map = { 'add' : ['mul ADD add', 'mul'], 'mul' : ['atom MUL mul', 'atom'], 'atom': ['NUM', 'LPAR add RPAR', 'neg'], 'neg' : ['ADD atom'],}fix_assoc_rules = 'add', 'mul' bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}def calc_binary(x): while len(x) > 1: x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ] return x[0] calc_map = { 'NUM' : float, 'atom': lambda x: x[len(x)!=1], 'neg' : lambda (op,num): (num,-num)[op=='-'], 'mul' : calc_binary, 'add' : calc_binary,} def match(rule_name, tokens): if tokens and rule_name == tokens[0].name: # Match a token? return tokens[0], tokens[1:] for expansion in rule_map.get(rule_name, ()): # Match a rule? remaining_tokens = tokens matched_subrules = [] for subrule in expansion.split(): matched, remaining_tokens = match(subrule, remaining_tokens) if not matched: break # no such luck. next expansion! matched_subrules.append(matched) else: return RuleMatch(rule_name, matched_subrules), remaining_tokens return None, None # match not found def _recurse_tree(tree, func): return map(func, tree.matched) if tree.name in rule_map else tree[1] def flatten_right_associativity(tree): new = _recurse_tree(tree, flatten_right_associativity) if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name: new[-1:] = new[-1].matched return RuleMatch(tree.name, new) def evaluate(tree): solutions = _recurse_tree(tree, evaluate) return calc_map.get(tree.name, lambda x:x)(solutions) def calc(expr): split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr) tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr] tree = match('add', tokens)[0] tree = flatten_right_associativity( tree ) return evaluate(tree) if __name__ == '__main__': while True: print( calc(raw_input('> ')) )