Key points for compiling a basic code interpreter using Python

Source: Internet
Author: User
Tags lexer python list
Python, Ruby, and other language code are interpreted as machine code synchronous execution in a line in the interpreter program, if you use the Python interpreter, you can interpret the target code as the Python code before performing the interpretation, here, let's take a look at the key points of compiling a basic code interpreter using Python: it has always been of great interest to the compiler and parser, it is also very clear about the concept of a compiler and the overall framework, but not very familiar with the details. The source code of our program is actually a string of characters. the compiler or interpreter can directly understand and execute this character sequence, which looks amazing. This article will use Python to implement a simple parser for interpreting a small list operation language (similar to the python list ). In fact, the compiler and interpreter are not mysterious. as long as you understand the basic theory, the implementation is relatively simple (of course, a product-level compiler or interpreter is still very complicated ).
Operations supported by this list language:

Veca = [1, 2, 3] # List declaration vecb = [4, 5, 6] print 'veca: ', veca # print string, list, print expr + print 'veca * 2: ', veca * 2 # List and integer multiplication print 'veca + 2 :', veca + 2 # List and integer addition print 'veca + vecb: ', veca + vecb # List addition print 'veca + [11, 12]:', veca + [11, 12] print 'veca * vecb: ', veca * vecb # List multiplication print 'veca:', veca print 'vecb: ', vecb

Corresponding output:

veca: [1, 2, 3] veca * 2: [2, 4, 6] veca + 2: [1, 2, 3, 2] veca + vecb: [1, 2, 3, 2, 4, 5, 6] veca + [11, 12]: [1, 2, 3, 2, 11, 12] veca * vecb: [4, 5, 6, 8, 10, 12, 12, 15, 18, 8, 10, 12] veca: [1, 2, 3, 2] vecb: [4, 5, 6] 

The compiler and interpreter are basically consistent with the way people understand sentences when processing input sentence streams. For example:

I love you. 

If you are a beginner in English, you must first understand the meaning of each word, and then analyze the parts of speech of each word to conform to the structure of the primary-predicate-object, so that you can understand the meaning of this sentence. This sentence is a character sequence, which generates a lexical unit stream according to the lexical division. In fact, this is the lexical analysis, which completes the conversion from the embedding stream to the lexical unit stream. Analyzes the part of speech and determines the structure of the subject and the object. this structure is identified according to the English syntax. this is the syntax analysis, which identifies the syntax parsing tree based on the input lexical unit stream. Finally, the meaning of the sentence is obtained based on the meaning and syntax structure of the word. this is semantic analysis. The process of the compiler is similar to that of the interpreter, but it is slightly more complicated. here we only focus on the interpreter:

We only implement a very simple small language, so it does not involve syntax tree generation and complex semantic analysis in the future. Next, let's take a look at lexical analysis and syntax analysis.
Lexical analysis and syntax analysis are completed by the lexical parser and syntax parser respectively. These two resolvers have similar structures and functions, both of which are input in a single input sequence, and then identify a specific structure. The lexical parser parses a token (lexical unit) from the source code token stream, and the syntax parser identifies the substructure and lexical unit, and then performs some processing. The LL (1) recursive descent parser can be used to implement these two types of resolvers. the steps completed by this type of parser are: the type of the prediction clause, call the parsing function to match the sub-structure, match the lexical unit, and insert code as needed to perform custom operations.
Here we will give a brief introduction to LL (1). the Statement structure is usually represented by a tree structure, which is called a resolution tree. LL (1) syntax parsing depends on the resolution tree. For example, x = x + 2;


In this tree, leaf nodes such as x, =, and 2 are called end nodes, and others are called non-end nodes. LL (1) does not need to create a specific tree-type data structure during resolution. you can write a resolution function for each non-end node and call it when the corresponding node is encountered, in this way, the information of the resolution tree can be obtained through the call sequence of the resolution function (equivalent to the traversal of the tree. During LL (1) parsing, it is executed in the order from the root node to the leaf node, so this is a "drop" process, and the parsing function can call itself, so it is "recursive", so LL (1) is also called the recursive descent parser.
In LL (1), both L are left-to-right. the first L indicates that the parser parses the input content from left to right, the second L indicates that in the descent process, the subnode is traversed from left to right, and 1 indicates making a prediction based on one forward-looking unit.
Next, let's take a look at the implementation of the list of small languages. The first is the language grammar, which is used to describe a language.

statlist: stat+ stat: ID '=' expr   | 'print' expr (, expr)* expr: multipart ('+' multipart)*   | STR multipart: primary ('*' primary)* primary: INT   | ID   | '[' expr (',', expr)* ']' INT: (1..9)(0..9)* ID: (a..z | A..Z)* STR: (\".*\") | (\'.*\') 

This is a grammar described in DSL. most of the concepts are similar to regular expressions. "A | B" indicates a or B. all strings enclosed in single quotes are keywords, such as print and =. Uppercase words are lexical units. We can see that the syntax of this small language is still very simple. There are many parser generators that can automatically generate the corresponding parser according to the syntax, such as ANTRL, flex, yacc, etc. here the handwritten parser is used mainly to understand the principle of the parser. The following describes how the interpreter of this small language is implemented.
The first step is the lexical parser, which converts the token stream to the token stream. LL (1) is used, so one token of forward character prediction matching is used. For a lexical rule consisting of multiple characters such as INT and ID, the parser has a corresponding method. Because the syntax parser does not care about white space characters, the lexical parser ignores white space characters. Each token has two property types and values, such as integer and identifier types. for an integer type, its value is the integer. The syntax parser needs to predict based on the token type, so the lexical parsing must return the type information. The syntax parser obtains tokens using iterator. Therefore, the lexical parser implements the next_token method and returns the next token in the form of tuples (type, value). In the absence of tokens, the parser returns EOF.

''''' A simple lexer of a small vector language.  statlist: stat+ stat: ID '=' expr   | 'print' expr (, expr)* expr: multipart ('+' multipart)*   | STR multipart: primary ('*' primary)* primary: INT   | ID   | '[' expr (',', expr)* ']' INT: (1..9)(0..9)* ID: (a..z | A..Z)* STR: (\".*\") | (\'.*\')  Created on 2012-9-26  @author: bjzllou '''  EOF = -1  # token type COMMA = 'COMMA' EQUAL = 'EQUAL' LBRACK = 'LBRACK' RBRACK = 'RBRACK' TIMES = 'TIMES' ADD = 'ADD' PRINT = 'print' ID = 'ID' INT = 'INT' STR = 'STR'  class Veclexer:   '''''   LL(1) lexer.   It uses only one lookahead char to determine which is next token.   For each non-terminal token, it has a rule to handle it.   LL(1) is a quit weak parser, it isn't appropriate for the grammar which is   left-recursive or ambiguity. For example, the rule 'T: T r' is left recursive.   However, it's rather simple and has high performance, and fits simple grammar.   '''      def __init__(self, input):     self.input = input          # current index of the input stream.     self.idx = 1          # lookahead char.     self.cur_c = input[0]        def next_token(self):     while self.cur_c != EOF:       c = self.cur_c              if c.isspace():         self.consume()       elif c == '[':         self.consume()         return (LBRACK, c)       elif c == ']':         self.consume()         return (RBRACK, c)       elif c == ',':         self.consume()         return (COMMA, c)       elif c == '=':         self.consume()         return (EQUAL, c)       elif c == '*':         self.consume()         return (TIMES, c)       elif c == '+':         self.consume()         return (ADD, c)       elif c == '\'' or c == '"':         return self._string()       elif c.isdigit():         return self._int()       elif c.isalpha():         t = self._print()         return t if t else self._id()       else:         raise Exception('not support token')          return (EOF, 'EOF')          def has_next(self):     return self.cur_c != EOF         def _id(self):     n = self.cur_c     self.consume()     while self.cur_c.isalpha():       n += self.cur_c       self.consume()            return (ID, n)      def _int(self):     n = self.cur_c     self.consume()     while self.cur_c.isdigit():       n += self.cur_c       self.consume()            return (INT, int(n))      def _print(self):     n = self.input[self.idx - 1 : self.idx + 4]     if n == 'print':       self.idx += 4       self.cur_c = self.input[self.idx]              return (PRINT, n)          return None      def _string(self):     quotes_type = self.cur_c     self.consume()     s = ''     while self.cur_c != '\n' and self.cur_c != quotes_type:       s += self.cur_c       self.consume()     if self.cur_c != quotes_type:       raise Exception('string quotes is not matched. excepted %s' % quotes_type)          self.consume()          return (STR, s)          def consume(self):     if self.idx >= len(self.input):       self.cur_c = EOF       return     self.cur_c = self.input[self.idx]     self.idx += 1             if __name__ == '__main__':   exp = '''''     veca = [1, 2, 3]     print 'veca:', veca     print 'veca * 2:', veca * 2     print 'veca + 2:', veca + 2   '''   lex = Veclexer(exp)   t = lex.next_token()      while t[0] != EOF:     print t     t = lex.next_token() 

Run this program to obtain the source code:

veca = [1, 2, 3] print 'veca:', veca print 'veca * 2:', veca * 2 print 'veca + 2:', veca + 2 

Corresponding token sequence:

('ID', 'veca') ('EQUAL', '=') ('LBRACK', '[') ('INT', 1) ('COMMA', ',') ('INT', 2) ('COMMA', ',') ('INT', 3) ('RBRACK', ']') ('print', 'print') ('STR', 'veca:') ('COMMA', ',') ('ID', 'veca') ('print', 'print') ('STR', 'veca * 2:') ('COMMA', ',') ('ID', 'veca') ('TIMES', '*') ('INT', 2) ('print', 'print') ('STR', 'veca + 2:') ('COMMA', ',') ('ID', 'veca') ('ADD', '+') ('INT', 2) 

Next, let's take a look at the implementation of the syntax parser. The input of the syntax parser is a token stream, which predicts matching rules based on a forward-looking lexical unit. The corresponding parsing function is called for each non-Terminator encountered, while the Terminator (token) is matched. if it does not match, it indicates a syntax error. Because LL (1) is used, it is similar to the lexical parser and will not be described here.

''''' A simple parser of a small vector language.  statlist: stat+ stat: ID '=' expr   | 'print' expr (, expr)* expr: multipart ('+' multipart)*   | STR multipart: primary ('*' primary)* primary: INT   | ID   | '[' expr (',', expr)* ']' INT: (1..9)(0..9)* ID: (a..z | A..Z)* STR: (\".*\") | (\'.*\')  example: veca = [1, 2, 3] vecb = veca + 4  # vecb: [1, 2, 3, 4] vecc = veca * 3  # vecc:  Created on 2012-9-26  @author: bjzllou ''' import veclexer  class Vecparser:   '''''   LL(1) parser.   '''      def __init__(self, lexer):     self.lexer = lexer          # lookahead token. Based on the lookahead token to choose the parse option.     self.cur_token = lexer.next_token()          # similar to symbol table, here it's only used to store variables' value     self.symtab = {}        def statlist(self):     while self.lexer.has_next():       self.stat()      def stat(self):     token_type, token_val = self.cur_token          # Asignment     if token_type == veclexer.ID:       self.consume()              # For the terminal token, it only needs to match and consume.       # If it's not matched, it means that is a syntax error.       self.match(veclexer.EQUAL)              # Store the value to symbol table.       self.symtab[token_val] = self.expr()            # print statement     elif token_type == veclexer.PRINT:       self.consume()       v = str(self.expr())       while self.cur_token[0] == veclexer.COMMA:         self.match(veclexer.COMMA)         v += ' ' + str(self.expr())       print v     else:       raise Exception('not support token %s', token_type)        def expr(self):     token_type, token_val = self.cur_token     if token_type == veclexer.STR:       self.consume()       return token_val     else:       v = self.multipart()       while self.cur_token[0] == veclexer.ADD:         self.consume()         v1 = self.multipart()         if type(v1) == int:           v.append(v1)         elif type(v1) == list:           v = v + v1              return v           def multipart(self):     v = self.primary()     while self.cur_token[0] == veclexer.TIMES:       self.consume()       v1 = self.primary()       if type(v1) == int:         v = [x*v1 for x in v]       elif type(v1) == list:         v = [x*y for x in v for y in v1]              return v            def primary(self):     token_type = self.cur_token[0]     token_val = self.cur_token[1]          # int     if token_type == veclexer.INT:       self.consume()       return token_val          # variables reference     elif token_type == veclexer.ID:       self.consume()       if token_val in self.symtab:         return self.symtab[token_val]       else:         raise Exception('undefined variable %s' % token_val)          # parse list     elif token_type == veclexer.LBRACK:       self.match(veclexer.LBRACK)       v = [self.expr()]       while self.cur_token[0] == veclexer.COMMA:         self.match(veclexer.COMMA)         v.append(self.expr())       self.match(veclexer.RBRACK)              return v           def consume(self):     self.cur_token = self.lexer.next_token()      def match(self, token_type):     if self.cur_token[0] == token_type:       self.consume()       return True     raise Exception('expecting %s; found %s' % (token_type, self.cur_token[0]))      if __name__ == '__main__':   prog = '''''     veca = [1, 2, 3]     vecb = [4, 5, 6]     print 'veca:', veca     print 'veca * 2:', veca * 2     print 'veca + 2:', veca + 2     print 'veca + vecb:', veca + vecb     print 'veca + [11, 12]:', veca + [11, 12]     print 'veca * vecb:', veca * vecb     print 'veca:', veca     print 'vecb:', vecb   '''   lex = veclexer.Veclexer(prog)   parser = Vecparser(lex)   parser.statlist() 

Run the code to get the output content described earlier. This interpreter is very simple and only implements basic expression operations. Therefore, you do not need to build a syntax tree. To add a control structure for the list language, you must implement the syntax tree and explain the execution on the basis of the syntax tree.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.