A tutorial to implement a recursive descent parser using 70 lines of Python code

A tutorial to implement a recursive descent parser using 70 lines of Python code _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags mul

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first step: marking

The first step in processing an expression is to convert it into a list of independent symbols. This is a simple step and not the focus of this article, so I've omitted a lot here.
First, I've defined some tags (numbers are not in here, they are the default tags) and a tag type:

Token_map = {' + ': ' Add ', '-': ' Add ', ' * ': '
       MUL ', '/': ' MUL ',
       ' (': ' LPAR ', ') ': ' Rpar '}
 
token = Namedtuple (' Token ', [' name ', ' value '])

Here is the code I used to mark the expression ' expr ':

split_expr = Re.findall (' [\d.] +| [%s] '% '. Join (Token_map), expr)
tokens = [Token (token_map.get (x, ' NUM '), x) for x in split_expr]

The first line is the technique of splitting the expression into basic markup, so

' 1.2/(11+3) '--> [' 1.2 ', '/', ' (', ' 11 ', ' + ', ' 3 ', ') ']

The next line of named tags, so that the parser can identify them by Category:

[' 1.2 ', '/', ' (', ', ', ' + ', ' 3 ', ') ']
->
[Token (name= ' NUM ', value= ' 1.2 '), Token (name= ' MUL ', value= '/'), Token (name= ' LPAR ', value= ' ("), Token (Name= ') NUM ', value= '), Token (name= ' ADD ', value= ' + '), Token (name= ' NUM ', value= ' 3 '), Token (name= ' Rpar ', value= ')]

Any tag that is not in Token_map is assumed to be a number. Our word breaker lacks a property called validation to prevent the non-numeric acceptance, but fortunately, the operator will deal with it later.
That's it
Step Two: syntax definition

The parser I choose is implemented from a local vertical parser, which comes from a simple version of the LL parser. It is one of the simplest parser implementations, in fact, with just 14 lines of code. It is a top-down parser, which means that the parser begins parsing from the top-level rule (like:expression), and then tries to parse it recursively by its child rule until the lowest rule (like:number) is met. To explain in other words, when the bottom-up parser (LR) gradually shrinks the tag, the rule is contained in other rules until only one rule is left, and the Top-down parser (LL) gradually expands the rule and goes to a few abstract rules until it can exactly match the input tag.
Before we dive into the actual parser implementations, we can discuss the syntax. In my previous article, I used the LR parser, and I could define calculator syntax as follows: The tag is expressed in uppercase letters.

Add:add Add Mul | Mul;
Mul:mul Mul Atom | Atom;
Atom:num | ' (' Add ') ' | neg;
Neg: '-' atom;

(If you do not understand the above syntax, please read the article I published earlier)

Now I'm going to use the LL parser to define the calculator syntax in the following way:

Rule_map = {
  ' add ': [' mul add ', ' Mul '],
  ' mul ': [' Atom Mul mul ', ' atom '],
  ' atom ': [' NUM ', ' LPAR add Rpar ' , ' neg '],
  ' neg ': [' ADD Atom '],
}

As you can see, there's a subtle change here. The recursive definition of "add and Mul" is reversed. This is a very important detail and I will explain it to you in detail.

The LR version uses the left recursive pattern. When the LL parser encounters recursion, it tries to match the rules. So, when the left recursion occurs, the parser goes into infinite recursion. Even the smart ll parser, such as ANTLR, could not escape the problem, and it would replace infinite recursion with friendly error prompts, unlike our toy parser.

Left recursion can be easily converted to right recursion, as I do. But the parser is not that simple, and it creates another problem: when the left recursive correct parsing 3-2-1 is (3-2)-1, and the right recursion is incorrectly resolved to 3-(2-1). I haven't thought of a simple solution, so in order to make things simple, I decided to let it continue using the wrong parsing format and deal with the problem later (see step 4)

Step three: Resolve to an AST

The algorithm is actually very simple. We will define a recursive method that receives two parameters: The first argument is the name of the rule we are trying to match, and the second argument is the list of identities we want to keep. We start with the Add (top-level rule) method, which already contains the complete list of identities, and the recursive call is very clear. method returns an array that contains the elements: one is the current match, and the other is a list of identities that hold a match. We will implement the identity matching feature to make this code available (they are all string types; one is uppercase and the other is lowercase).

The following is the code implemented by the parser:

Rulematch = namedtuple (' Rulematch ', [' name ', ' matched '])
 
def match (Rule_name, tokens):
  if tokens and rule_name = = Tokens[0].name:   # does it match the identity?
    Return Rulematch (Tokens[0], tokens[1:]) for
  expansion in Rule_map.get (Rule_name, ()):  # Do you want to match the rules?
    Remaining_tokens = tokens
    matched_subrules = []
    for subrule in Expansion.split ():
      matched, Remaining_ tokens = Match (subrule, remaining_tokens)
      if not matched:
        break  # Bad luck, jump out of the loop, deal with the next extension definition!
      Matched_subrules.append (matched)
    else: return
      rulematch (Rule_name, matched_subrules), Remaining_tokens Return
  None, none  # no matching result

The code 4 through 5 line Description: If the rule name (Rule_name) is indeed an identity and is included in the identity list (tokens), check that it matches the current identity. If so, the expression returns the matching method, and the identity list is still in use.

The code line 6th illustrates that the iteration will loop through the matching of the child rules for the rule name and recursively implement a match for each rule. If the rule name meets the criteria for matching identities, the Get () method returns an empty array, and the code returns a null value (see line 16).

第9-15 rows to implement the current sub-rule of the iteration and try to match them sequentially. Match identities as much as possible for each iteration. If an identity doesn't match, we give up the entire sub-rule. However, if all the identities are matched successfully, we arrive at the Else statement and return the matching value of the Rule_name, as well as the remaining identities.

Now run and look at the results of 1.2/(11+3).

>>> tokens = [Token (name= ' NUM ', value= ' 1.2 '), Token (name= ' MUL ', value= '/'), Token (name= ' LPAR ', value= '), Token (name= ' num ', value= '), Token (name= ' ADD ', value= ' + '), Token (name= ' num ', value= ' 3 '), Token (name= ' Rpar ', value= ') )
 
>>> match (' Add ', tokens)
 
(Rulematch (name= ' Add ', Matched=[rulematch (name= ' mul '), matched=[ Rulematch (name= ' atom ', Matched=[token (name= ' NUM ', value= ' 1.2 ')]), Token (name= ' MUL ', value= '/'), Rulematch (Name= ') Mul ', Matched=[rulematch (name= ' atom ', Matched=[token (name= ' LPAR ', value= '), Rulematch (name= ' Add ', matched=[ Rulematch (name= ' Mul ', Matched=[rulematch (name= ' atom ', Matched=[token (name= ' NUM ', value= ')]), Token (name= ' ADD ' , value= ' + '), Rulematch (name= ' Add ', Matched=[rulematch (name= ' mul '), Matched=[rulematch (Name= ' atom ', Matched=[token ( Name= ' NUM ', value= ' 3 ')])] ()]), Token (name= ' Rpar ', value= '))] ())]), [])

The result is a tuple, and of course we don't see the rest of the logo. Matching results are not easy to read, so let me draw the results into a diagram:

Add
  mul
    atom
      NUM ' 1.2 '
    mul '/'
    mul
      atom
        LPAR  ' ('
        add
          mul
            Atom
              NUM ' one '
          Add ' + '
          add
            mul
              atom
                NUM ' 3 '
        rpar '  ) '

This is the concept of the AST. It's a good exercise to think about how the parser works, either through your thinking logic, or by describing it on paper. I can't say it's necessary unless you want to God. You can use the AST to help you implement the correct algorithm.

So far, we've done a parser that can handle binary operations, unary operations, parentheses, and operator precedence.

There is only one error left to resolve, and the steps below will resolve this error.

Fourth Step: follow-up processing

My parser doesn't work on any occasion. The most important point is that it does not handle left recursion, forcing me to write the code in the right recursive way. As a result, when parsing 8/4/2 this expression, the AST results are as follows:

Add
  mul
    atom
      num 8
    mul '/'
    mul
      atom
        num 4
      mul '/'
      mul
        Atom
          NUM 2

If we try to compute the results through the AST, we will give priority to 4/2, which of course is wrong. Some ll parsers choose to modify the relevance of the tree. This requires writing multiple lines of code; This is not accepted, we need to make it flat. The algorithm is simple: for each rule 1 in the AST you need to fix 2) is the same rule for the operator on the right of a binary operation (owning sub-rules) 3: Flattening the latter into the former. By "flat," I mean in the context of its parent node, the node is replaced by the son of the node. Because our crossing is the DFS is the sequence, meaning that it starts from the edge of the tree, and always reach the root, the effect will be cumulative. The following is the code:

Fix_assoc_rules = ' Add ', ' Mul '
 
def _recurse_tree (Tree, func): Return
  Map (func, tree.matched) if tree.name into rule _map Else tree[1]
 
def flatten_right_associativity (tree):
  new = _recurse_tree (tree, Flatten_right_ associativity)
  if tree.name in Fix_assoc_rules and Len (new) ==3 and New[2].name==tree.name:
    new[-1:] = new[-1]. Matched return
  Rulematch (Tree.name, new)

This code can make any structure of addition or multiplication expressions into a flat list (not confusing). Parentheses break the order, and of course they are not affected.

Based on these, I can put the code back into the left association:

def build_left_associativity (tree):
  new_nodes = _recurse_tree (tree, build_left_associativity)
  if Tree.name In Fix_assoc_rules: While
    len (new_nodes) >3:
      new_nodes[:3] = [Rulematch (Tree.name, New_nodes[:3])]
  Return Rulematch (Tree.name, New_nodes)

However, I will not do so. I need less code, and replacing the calculation code with the processing list will require less code than the entire tree.

Fifth step: the arithmetic device

The operation of the tree is very simple. You just have to traverse the tree in a way similar to the post-processing code (that is, the DFS sequence) and follow each of these rules. For the operators, because we use recursive algorithms, each rule must contain only numbers and operators. The code is as follows:

Bin_calc_map = {' * ': mul, '/':d IV, ' + ': Add, '-': Sub}
def calc_binary (x): While
  len (x) > 1:
    x[:3] = [bin_cal C_MAP[X[1]] [x[0], x[2])] return
  x[0]
 
calc_map = {
  ' NUM ': float,
  ' atom ': Lambda X:x[len (x)!=1],
  ' Neg ': Lambda (op,num): (num,-num) [op== '],
  ' mul ': calc_binary,
  ' Add ': Calc_binary,
 
def Evaluate (tree):
  solutions = _recurse_tree (tree, evaluate) return
  calc_map.get (tree.name, Lambda x:x) ( Solutions

I use the Calc_binary function for addition and subtraction operations (and their same order operations). It calculates these operations in the list in a left-bound way, which makes our ll syntax less likely to get results.

Sixth step: REPL

The most simple repl:

if __name__ = = ' __main__ ': While
  True:
    print (Calc (raw_input (' > '))

Don't let me explain it:)
Appendix: Merging them: A 70-line calculator

' A Calculator implemented with a top-down, recursive-descent Parser ' ' # Author:erez Shinan, Dec import Re, col Lections from operator import Add,sub,mul,div Token = collections.namedtuple (' Token ', [' name ', ' value ']) Rulematch = col Lections.namedtuple (' Rulematch ', [' name ', ' matched ']) Token_map = {' + ': ' Add ', '-': ' Add ', ' * ': ' MUL ', '/': ' MUL ', ' (': ' LPA  R ', ') ': ' Rpar '} rule_map = {' Add ': [' mul add ', ' mul '], ' mul ': [' Atom Mul mul ', ' atom '], ' atom ': [' NUM ', ' LPAR Add Rpar ', ' neg '], ' neg ': [' Add atom '],} fix_assoc_rules = ' add ', ' mul ' bin_calc_map = {' * ': mul, '/':d IV, ' + ': ADD, '-': Sub} def calc_binary (x): While Len (x) > 1:x[:3] = [bin_calc_map[x[1]] (x[0], x[2]) "Return x[0] Calc_ma p = {' NUM ': float, ' atom ': Lambda X:x[len (x)!=1], ' neg ': Lambda (op,num): (num,-num) [op== '-'], ' mul ': Calc_bi nary, ' Add ': Calc_binary,} def match (Rule_name, tokens): if tokens and rule_name = tokens[0].name: # match A To
    Ken? return Tokens[0], Tokens[1:] for expansion in Rule_map.get (Rule_name, ()): # Match a rule? Remaining_tokens = Tokens Matched_subrules = [] for subrule in Expansion.split (): Matched, Remaining_tokens = Match (subrule, remaining_tokens) if not matched:break # no such luck.
      Next expansion! Matched_subrules.append (matched) Else:return Rulematch (Rule_name, matched_subrules), Remaining_tokens return None, none # match not found def _recurse_tree [Tree, Func]: Return Map (func, tree.matched) if tree.name in Rule_map E LSE TREE[1] def flatten_right_associativity (tree): new = _recurse_tree (tree, flatten_right_associativity) if tree.na Me in Fix_assoc_rules and Len (new) ==3 and new[2].name==tree.name:new[-1:] = new[-1].matched return Rulematch (TREE.N Ame, new) def evaluate (tree): Solutions = _recurse_tree (tree, evaluate) return Calc_map.get (Tree.name, Lambda x:x) (s Olutions def calc (expr): split_expr = Re.findall (' [\d.] +| [%s] '% '. JoiN (token_map), expr) tokens = [Token (token_map.get (x, ' NUM '), x) for x in split_expr] tree = match (' Add ', tokens) [0] Tree = flatten_right_associativity [tree] return evaluate [tree] if __name__ = = ' __main__ ': while True:print (c
 ALC (raw_input (' > '))

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More