Use Python to write a basic explanation of the key points of the Code interpreter

Source: Internet
Author: User
There has always been a great interest in compilers and parsers, as well as the concept of a compiler and the overall framework, but it is not well understood in the detail section. The program source code that we write is actually a sequence of characters, and the compiler or interpreter can directly understand and execute the sequence of characters, which seems to be fantastic. This article uses Python to implement a simple parser that explains a small list manipulation language (similar to Python's list). In fact, the compiler, interpreter is not mysterious, as long as the basic theoretical understanding, the implementation is relatively simple (of course, a product-level compiler or interpreter is still very complex).
Operations supported by this list language:

Veca = [1, 2, 3]  # List Declaration VECB = [4, 5, 6] print ' Veca: ', Veca   # Print string, list, print expr+ print ' Veca * 2: ', Veca * 2
  # list with integer multiplication print ' Veca + 2: ', Veca + 2   # list with integer addition print ' Veca + VECB: ', Veca + VECB  # list Add print ' Veca + [11, 12]: ', Veca + [one, one]   print ' Veca * VECB: ', Veca * VECB  

Corresponding output:

Compilers and interpreters are basically consistent with how people understand sentences when they are working with input character streams. Like what:

If you learn English first, you need to know the meaning of each word, then analyze the word of speech, in line with the structure of the main-predicate-bin, so as to understand the meaning of this sentence. This sentence is a sequence of characters, according to lexical division to get a lexical unit flow, in fact, this is lexical analysis, to complete the transformation from the character stream to the lexical unit flow. Analyzing the part of speech, determining the structure of the principal predicate, is to recognize the structure according to the English grammar, which is the grammatical analysis, which identifies the parsing tree according to the input lexical unit stream. Finally, combined with the meaning and grammatical structure of the word, the meaning of this sentence is finally obtained, which is semantic analysis. The compiler and interpreter processes are similar, but slightly more complex, with only the interpreter in focus:

We just implement a very simple small language, so it doesn't involve the generation of syntax trees, and the subsequent complex semantic analysis. Let's look at lexical analysis and grammar analysis.
Lexical parsing and parsing are done by lexical parser and grammar parser, respectively. The two parsers have similar structures and functions, all of which are entered as an input sequence and then identify a particular structure. The lexical parser parses a token (lexical unit) from the source code character stream, and the parser recognizes the substructure and lexical unit, and then does some processing. The two parsers can be implemented by LL (1) Recursive descent parser, which completes the type of prediction clause, invokes the parse function to match the sub-structure, matches the lexical unit, and then inserts the code as needed to perform the custom action.
Here is a brief introduction to LL (1), the structure of the statement is usually expressed in a tree structure, called the Parse Tree, LL (1) To do parsing is dependent on the parse tree. For example: x = x +2;


In this tree, leaf nodes such as x, =, and 2 are called end nodes, others are called non-terminating nodes. LL (1) When parsing, do not need to create a specific tree data structure, can be written for each non-end node parsing function, the corresponding node when the call, so that the parse function in the call sequence (equivalent to the tree traversal) to obtain the information of the parse tree. In LL (1) parsing, is performed according to the order from the root node to the leaf node, so this is a "descent" process, and the analytic function can call itself, so is "recursive", so ll (1) is also called recursive descent parser.
LL (1) in two L is left-to-right meaning, the first l indicates that the parser parses the input from left to right, the second L means that the child nodes are traversed in a left-to-right order, and 1 indicates predictions based on 1 forward-looking units.
Let's take a look at the implementation of the list of small languages, first of all the grammar of the language, the grammar used to describe a language, is the parser's design specification.

statlist:stat+ stat:id ' = ' Expr   | ' Print ' expr (, expr) * Expr:multipart (' + ' multipart) *   | STR multipart:primary (' * ' primary) * Primary:int   | Id   

This is a grammar described in DSL, most of which are similar to regular expressions. "A|b" denotes a or B, and all strings enclosed in single quotes are keywords, such as print,=. Uppercase words are lexical units. It is easy to see the grammar of this little language. There are many parser generators that can automatically generate the corresponding parser based on grammar, such as: ANTRL,FLEX,YACC, and so on, the use of handwritten parser is mainly to understand the principle of the parser. Let's look at how the interpreter for this little language is implemented.
The first is the lexical parser, which completes the conversion of the character stream to the token stream. Implemented with LL (1), so use 1 forward-looking characters to predict matching tokens. For lexical rules that consist of multiple characters, such as int and ID, the parser has a corresponding method. Because the parser does not care about whitespace characters, the lexical parser ignores whitespace characters directly. Each token has two attribute types and values, such as Integer, identifier type, etc., and its value is that integer for an integral type. The parser needs to make predictions based on the type of token, so lexical parsing must return type information. The parser obtains tokens in a iterator manner, so the lexical parser implements the Next_token method, returns the next token in tuple mode (type, value), and returns EOF when there is no token.

"A simple lexer of a small vector language. statlist:stat+ stat:id ' = ' expr | ' Print ' expr (, expr) * Expr:multipart (' + ' multipart) * | STR multipart:primary (' * ' primary) * primary:int | ID | ' [' Expr (', ', expr) * '] ' INT: (1..9) (0..9) * ID: (a). Z | A.. Z) * STR: (\ ". *\") |  (\'.*\') Created on 2012-9-26 @author: Bjzllou "EOF =-1 # token type COMMA = ' COMMA ' EQUAL = ' EQUAL ' lbrack = ' lbrack ' Rbrac K = ' rbrack ' times = ' times ' add = ' add ' Print = ' print ' id = ' id ' int = ' int ' str = ' str ' class Veclexer: ' LL (1   ) Lexer.   It uses only one lookahead char to determine which is next token.   For each non-terminal token, it had a rule to handle it. LL (1) is a quit weak parser, it isn ' t appropriate for the grammar which is left-recursive or ambiguity.   For example, the rule ' T:t R ' was left recursive.   However, it ' s rather simple and have high performance, and fits simple grammar. "Def __init__ (self, input): Self.input = input # current INdex of the input stream.     SELF.IDX = 1 # lookahead char. Self.cur_c = input[0] def next_token (self): while self.cur_c! = Eof:c = Self.cur_c if C.iss          Pace (): Self.consume () elif c = = ' [': Self.consume () return (Lbrack, c) elif c = = '] ':       Self.consume () return (Rbrack, c) elif c = = ', ': Self.consume () return (COMMA, c) elif c = = ' = ': Self.consume () return (EQUAL, c) elif c = = ' * ': Self.consume () r         Eturn (Times, c) elif c = = ' + ': Self.consume () return (ADD, c) elif c = = ' \ ' or c = = ' "': Return self._string () elif c.isdigit (): Return Self._int () elif C.isalpha (): t = SELF._PR  int () return T if T else self._id () else:raise Exception (' not support token ') return (EOF,       ' EOF ') def has_next (self): return self.cur_c! = EOF  def _id (self): n = self.cur_c self.consume () while Self.cur_c.isalpha (): n + = Self.cur_c self.co Nsume () return (ID, n) def _int (self): n = self.cur_c self.consume () while Self.cur_c.isdigit ( ): n + = Self.cur_c Self.consume () return (int, int (n)) def _print (self): n = self.input[se Lf.idx-1: Self.idx + 4] if n = = ' print ': Self.idx + = 4 Self.cur_c = Self.input[self.idx] R     Eturn (PRINT, N) return None def _string (self): Quotes_type = Self.cur_c Self.consume () s = "  While self.cur_c! = ' \ n ' and self.cur_c! = quotes_type:s + self.cur_c self.consume () if self.cur_c! =          Quotes_type:raise Exception (' string quotes is not matched. Excepted%s '% Quotes_type) Self.consume () Return (STR, s) def consume (self): if Self.idx >= len (self.input): Self.cur_c = EOF Retu RN Self.cur_c = Self. input[self.idx] Self.idx + = 1 if __name__ = = ' __main__ ': exp = ' "' Veca = [1, 2, 3] print ' ve CA: ', Veca print ' Veca * 2: ', Veca * 2 print ' Veca + 2: ', Veca + 2 ' ' Lex = veclexer (exp) T = Lex.next_toke  N () while t[0]! = eof:print T t = Lex.next_token ()

To run this program, you can get the source code:

The corresponding token sequence:

Next, look at the syntax parser implementation. The input to the parser is the token stream, which predicts the matching rules based on a forward-looking lexical unit. For each non-terminator call that is encountered, the corresponding parse function, and Terminator (token), matches, and if not, it indicates a syntax error. Since all are used LL (1), so and lexical parser similar, here no longer repeat.

"A simple parser of a small vector language. statlist:stat+ stat:id ' = ' expr | ' Print ' expr (, expr) * Expr:multipart (' + ' multipart) * | STR multipart:primary (' * ' primary) * primary:int | ID | ' [' Expr (', ', expr) * '] ' INT: (1..9) (0..9) * ID: (a). Z | A.. Z) * STR: (\ ". *\") |  (\ '. *\ ') Example:veca = [1, 2, 3] VECB = Veca + 4 # VECB: [1, 2, 3, 4] VECC = Veca * 3 # vecc:created on 2012-9-26   @author: Bjzllou ' Import veclexer class Vecparser: ' LL (1) parser. "Def __init__ (Self, lexer): Self.lexer = lexer # lookahead token.     Based on the lookahead token to choose the parse option.     Self.cur_token = Lexer.next_token () # Similar to symbol table, here it's only used to store variables ' value Self.symtab = {} def statlist (self): When Self.lexer.has_next (): Self.stat () def stat (self): to  Ken_type, token_val = self.cur_token # asignment If Token_type = = Veclexer.ID:self.consume ()            # for the terminal token, it is only needs to match and consume.       # If It ' s not matched, it means, it's a syntax error. Self.match (veclexer.       EQUAL) # Store the value to symbol table. Self.symtab[token_val] = self.expr () # print statement elif Token_type = = Veclexer. PRINT:self.consume () v = str (self.expr ()) while self.cur_token[0] = = Veclexer. COMMA:self.match (veclexer. COMMA) v + = "+ str (self.expr ()) Print v else:raise Exception (' Not support token%s ', Token_typ e) def expr (self): token_type, token_val = Self.cur_token if Token_type = = Veclexer. STR:self.consume () return token_val else:v = Self.multipart () while self.cur_token[0] = = VEC Lexer. ADD:self.consume () V1 = Self.multipart () if type (v1) = = Int:v.append (v1) elif Type (v1) = = List:v = v + v1 return v def mUltipart (self): v = self.primary () while self.cur_token[0] = = Veclexer. TIMES:self.consume () V1 = self.primary () if type (v1) = = Int:v = [x*v1 for x in V] elif t ype (v1) = = List:v = [x*y for x on V for y ' v1] return v def primary (self): token_ty PE = self.cur_token[0] Token_val = self.cur_token[1] # int if token_type = = Veclexer.INT:self.cons       Ume () return Token_val # variables reference elif Token_type = Veclexer.ID:self.consume () If token_val in Self.symtab:return Self.symtab[token_val] else:raise Exception (' undefined variabl E%s '% token_val) # parse list elif Token_type = = Veclexer. LBRACK:self.match (veclexer. Lbrack) v = [self.expr ()] while self.cur_token[0] = = Veclexer. COMMA:self.match (veclexer. COMMA) V.append (self.expr ()) Self.match (veclexer.   Rbrack) Return V        def consume (self): Self.cur_token = Self.lexer.next_token () def match (self, token_type): If Self.cur_  Token[0] = = Token_type:self.consume () return True raise Exception (' expecting%s; found%s '% (Token_type, Self.cur_token[0]) If __name__ = = ' __main__ ': prog = ' "' Veca = [1, 2, 3] VECB = [4, 5, 6] print ' V ECA: ', Veca print ' Veca * 2: ', Veca * 2 print ' Veca + 2: ', Veca + 2 print ' Veca + VECB: ', Veca + VECB Prin T ' Veca + [one, one]: ', Veca + [one, '] print ' Veca * VECB: ', Veca * VECB print ' Veca: ', Veca print ' VECB: ', VEC B "' Lex = veclexer.  Veclexer (Prog) parser = Vecparser (Lex) parser.statlist ()

Running the code will get the output from the previous introduction. This interpreter is extremely primitive and only implements basic expression operations, so there is no need to construct a syntax tree. If you want to add a control structure to a list language, you must implement a syntax tree that interprets execution on the basis of the syntax tree.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.