Key points for writing a most basic code interpreter using Python _python

Source: Internet
Author: User
Tags lexer python list

There has always been a great interest in compilers and parsers, and it is clear that the concept of a compiler and the overall framework, but not very understanding of the details of the part. The program source code we write is actually a sequence of characters, and the compiler or interpreter can directly understand and execute the sequence of characters, which seems really fascinating. This article uses Python to implement a simple parser that interprets a small list manipulation language (similar to the Python list). In fact, compilers, interpreters are not mysterious, as long as the basic theoretical understanding, the implementation is relatively simple (of course, a product-level compiler or interpreter is still very complex).
The actions supported by this list language:

Veca = [1, 2, 3]  # List declaration 
VECB = [4, 5, 6] 
print ' Veca: ', Veca   # Print strings, lists, prints expr+ print 
' Veca * 2: ', Veca * 2   # list with integer multiplication 
print ' Veca + 2: ', Veca + 2   # list with integer addition 
print ' Veca + VECB: ', Veca + VECB  # list addition 
   print ' Veca + [one]: ', Veca + [one]   
print ' Veca * VECB: ', Veca * VECB  # list multiplication 
print ' Veca: ', veca
   print ' VECB: ', VECB 

Corresponding output:

Veca: [1, 2, 3] 
Veca * 2: [2, 4, 6] 
Veca + 2: [1, 2, 3, 2] 
Veca + VECB: [1, 2, 3, 2, 4, 5, 6] 
Veca + [ One]: [1, 2, 3, 2, one] 
Veca * VECB: [4, 5, 6, 8, 8, V,] 
Veca: [1, 2, 3, 2] 
V ECB: [4, 5, 6] 

When the compiler and interpreter handle the stream of input characters, they are basically consistent with the way people understand sentences. Like what:

I Love you. 

If beginners English, first need to know the meaning of each word, and then analyze the words of each word, according to the main-predicate-object structure, so as to understand the meaning of this sentence. This sentence is a sequence of characters, according to the lexical division to get a lexical unit flow, in fact, this is lexical analysis, complete from the character stream to the lexical unit flow transformation. The analysis of part of speech, the determination of the subject-predicate structure, is based on the English grammar to identify the structure, this is the grammar analysis, according to the input lexical unit flow recognition of the parsing tree. Finally, combining the meaning of the word with the grammatical structure, the final meaning of this sentence, this is the semantic analysis. The compiler and interpreter processes are similar, but slightly more complex, with only the interpreter concerned:

We are only implementing a very simple small language, so we do not involve the generation of syntax trees, as well as the subsequent complex semantic analysis. I'll take a look at lexical analysis and grammar analysis below.
Lexical analysis and grammar analysis are performed by lexical parser and parser respectively. The two parsers have similar structures and functions, all of which are input in an input sequence and then identify a particular structure. The lexical parser resolves one token (lexical unit) from the source code character stream, while the parser recognizes the substructure and the lexical unit, and then does some processing. You can implement these two parsers through a LL (1) Recursive descent parser, which completes the steps of predicting the type of the clause, calling the parse function to match the substructure, matching the lexical unit, and then inserting the code as needed to perform the custom action.
Here ll (1) Do a brief introduction, the structure of the sentence is usually expressed in the tree structure, called the Parse Tree, LL (1) To do parsing depends on the parse tree. For example: x = x +2;


In this tree, leaf nodes such as x, = and 2 are called end nodes, others are called non-finalization nodes. LL (1) When parsing, do not need to create a specific tree data structure, you can write a resolution function for each non-terminating node, when the corresponding node to invoke, so you can parse the function of the call sequence (the equivalent of the tree traversal) to obtain the resolution tree information. LL (1) When parsing, is in accordance with the order from the root node to the leaf node, so this is a "descent" process, and the analytic function can call itself, so is "recursive", so ll (1) is also called recursive descent parser.
LL (1) of the two L are left-to-right meaning, the first l indicates that the parser parses the input from left to right, and the second l indicates that in the descent process the child nodes are also traversed in left-to-right order, while 1 indicates a prediction based on 1 forward-looking units.
Look at the implementation of the list of small languages, first of all, the grammar of the language, the grammar used to describe a language, is a parser design instructions.

statlist:stat+ 
stat:id ' = ' Expr 
  | ' Print ' expr (, expr) * 
expr:multipart (' + ' multipart) * 
  | STR 
multipart:primary (' * ' primary) * 
Primary:int 
  | ID 
  | ' [' Expr (', ', expr] '] ' 
INT: (1..9) (0..9) * 
ID: (A.. Z | A.. Z) * 
STR: (\ ". *\") | (\'.*\') 

This is a grammar described in DSL, where most of the concepts and regular expressions are similar. "A|b" denotes a or B, and all strings enclosed in single quotes are keywords, such as: print,=. The uppercase word is the lexical unit. It is easy to see the grammar of this small language. There are many parser generators that can automatically generate corresponding parsers based on grammars, such as ANTRL,FLEX,YACC, which uses a handwritten parser mainly to understand the principles of the parser. Let's look at how the interpreter for this small language is implemented. The
first is the lexical parser, which completes the conversion of the character stream to the token stream. Using LL (1), we use 1 forward-looking characters to predict matching token. The parser has a corresponding method for lexical rules that consist of multiple characters, such as int and ID. Because the parser does not care about whitespace characters, the lexical parser is ignored when it encounters whitespace characters. Each token has two attribute types and values, such as integral type, identifier type, and so on, and its value is that integer for an integral type. The parser needs to be predictive based on the type of token, so lexical parsing must return type information. The parser obtains the token in a iterator way, so the lexical parser implements the Next_token method, returns the next token in tuple mode (type, value), and returns EOF when there is no token.
 

' A simple lexer of a small vector language. statlist:stat+ stat:id ' = ' expr | ' Print ' expr (, expr) * Expr:multipart (' + ' multipart) * | STR multipart:primary (' * ' primary) * primary:int | ID | ' [' Expr (', ', expr] '] ' INT: (1..9) (0..9) * ID: (A.. Z | A.. Z) * STR: (\ ". *\") | 
 
(\'.*\') Created on 2012-9-26 @author: Bjzllou ' EOF =-1 # token type COMMA = ' COMMA ' EQUAL = ' EQUAL ' lbrack = ' LBR ACK ' rbrack = ' rbrack ' times = ' times ' add = ' add ' Print = ' print ' id = ' id ' int = ' int ' str = ' str ' class vecle 
  Xer: ' LL (1) lexer. 
  It uses only one lookahead char to determine which is next token. 
  For each non-terminal token, it has a rule to handle it. LL (1) is a quit weak parser, it isn ' t appropriate for the grammar which is left-recursive or ambiguity. 
  For example, the rule ' t:t R ' are left recursive. 
  However, it ' s rather simple and has high performance, and fits simple grammar. "Def __init__" (Self, input): Self.input = input # Current index of the input stream. 
    SELF.IDX = 1 # lookahead char.  Self.cur_c = input[0] def next_token (self): while self.cur_c!= eof:c = Self.cur_c If C.isspace (): Self.consume () elif c = = ' [': Self.consume () return (Lbrack, c) Eli F c = = '] ': Self.consume () return (Rbrack, c) elif c = = ', ': Self.consume () ret Urn (COMMA, c) elif c = ' = ': Self.consume () return (EQUAL, c) elif c = = ' * ': sel F.consume () return (times, c) elif c = = ' + ': Self.consume () return (ADD, c) elif c = = ' \ ' or c = = ': Return self._string () elif c.isdigit (): Return Self._int () elif C.I Salpha (): t = self._print () return T if T else self._id () else:raise Exception (' not sup 
     Port token ')
    Return (EOF, ' EOF ') def Has_next (self): return self.cur_c!= EOF def _id (self): n = se Lf.cur_c Self.consume () while Self.cur_c.isalpha (): n + = Self.cur_c Self.consume () re Turn (ID, N) def _int (self): n = self.cur_c self.consume () while Self.cur_c.isdigit (): n = = s Elf.cur_c Self.consume () return (int, int (n)) def _print (self): n = self.input[self.idx- 1:SELF.IDX + 4] if n = = ' print ': Self.idx + 4 self.cur_c = Self.input[self.idx] Retur N (PRINT, N) return None def _string (self): Quotes_type = Self.cur_c Self.consume () s = ' ' While Self.cur_c!= ' \ n ' and self.cur_c!= quotes_type:s + = Self.cur_c self.consume () if SELF.C Ur_c!= quotes_type:raise Exception (' string quotes ' not matched. Excepted%s '% Quotes_type) Self.con Sume () return (STR, s) def consume (self): if Self.idx >= len (self.input): Self.cur_c = EOF return  Self.cur_c = Self.input[self.idx] Self.idx + + 1 if __name__ = = ' __main__ ': exp = ' "' Veca = [1, 2, 3] print ' Veca: ', Veca print ' Veca * 2: ', Veca * 2 print ' Veca + 2: ', Veca + 2 ' ' Lex = vecl 
 Exer (exp) t = Lex.next_token () while t[0]!= eof:print t = Lex.next_token ()

To run this program, you can get the source code:

Veca = [1, 2, 3] 
print ' Veca: ', Veca 
print ' Veca * 2: ', Veca * 2 
print ' Veca + 2: ', Veca + 2 

The corresponding token sequence:

(' ID ', ' Veca ') 
(' EQUAL ', ' = ') 
(' Lbrack ', ' ['] 
(' INT ', 1) 
(' COMMA ', ', ') 
(' INT ', 2) 
(' COMMA ', ', ') 
(' INT ', 3) 
(' Rbrack ', '] ') (' print ', ' print ') (' 
STR ', ' Veca: ') 
(' COMMA ', ', ') (' ID ', ' Veca ') (' 
print ') , ' print ' 
(' STR ', ' Veca * 2: ') 
(' COMMA ', ', ') 
(' ID ', ' Veca ') (' Times ', ' * ') (' The Times 
', ' * ') (' 
INT ', 2) 
(' print ', ' print ') 
(' STR ', ' Veca + 2: ') 
(' COMMA ', ', ') 
(' ID ', ' Veca ') 
(' ADD ', ' + ') 
(' INT ', 2) 

Next look at the syntax parser implementation. The input of the parser is the token stream, based on a forward-looking lexical unit to predict the matching rules. For each non-terminal call that corresponds to the parse function, non-terminal (token) matches out, and if it does not match, it indicates a syntax error. Since all are used LL (1), and the lexical parser is similar, here no longer repeat.

' A simple parser of a small vector language. statlist:stat+ stat:id ' = ' expr | ' Print ' expr (, expr) * Expr:multipart (' + ' multipart) * | STR multipart:primary (' * ' primary) * primary:int | ID | ' [' Expr (', ', expr] '] ' INT: (1..9) (0..9) * ID: (A.. Z | A.. Z) * STR: (\ ". *\") | (\ '. *\ ') Example:veca = [1, 2, 3] VECB = Veca + 4 # VECB: [1, 2, 3, 4] VECC = Veca * 3 # vecc:created on 2012 
  -9-26 @author: Bjzllou ' Import veclexer class Vecparser: ' LL (1) parser. ' Def __init__ (self, lexer): Self.lexer = lexer # lookahead token. 
    Based on the lookahead token to choose the parse option. 
    Self.cur_token = Lexer.next_token () # Similar to symbol table, where it's only used to store variables ' value Self.symtab = {} def statlist (self): while Self.lexer.has_next (): Self.stat () def stat (self ): Token_type, token_val = self.cur_token # asignment If Token_type = = Veclexer.ID:self.consume () # for the terminal token, it only needs to match and cons 
      Ume. 
      # If It ' s not matched, it means this is a syntax error. Self.match (veclexer. 
      EQUAL) # Store the value to symbol table. Self.symtab[token_val] = self.expr () # print statement elif Token_type = = Veclexer. PRINT:self.consume () v = str (self.expr ()) while self.cur_token[0] = = Veclexer. COMMA:self.match (veclexer. COMMA) v + = ' + str (self.expr ()) Print v else:raise Exception (' Not support token%s ', token _type def expr (self): token_type, token_val = Self.cur_token if Token_type = = Veclexer. STR:self.consume () return token_val else:v = Self.multipart () while self.cur_token[0] = = Veclexer. 
      ADD:self.consume () V1 = Self.multipart () if type (v1) = = Int:v.append (v1)  Elif type (v1) = = List:v = v + v1 return v def multipart (self): v = Self.prim ary () while self.cur_token[0] = = Veclexer. TIMES:self.consume () V1 = self.primary () if type (v1) = = Int:v = [x*v1 for x in V] E 
    Lif type (v1) = = List:v = [x*y for x with V for y in V1] return v def primary (self): 
      Token_type = self.cur_token[0] Token_val = self.cur_token[1] # int if Token_type = = Veclexer.int: 
      Self.consume () return Token_val # variables reference elif Token_type = veclexer.id: Self.consume () If token_val in Self.symtab:return Self.symtab[token_val] else:raise exc Eption (' Undefined variable%s '% token_val) # parse list elif Token_type = = Veclexer. LBRACK:self.match (veclexer. 
 Lbrack) v = [self.expr ()] while self.cur_token[0] = = Veclexer.comma:       Self.match (Veclexer.comma) v.append (self.expr ()) Self.match (veclexer. Rbrack) return v def consume (self): Self.cur_token = Self.lexer.next_token () def m Atch (self, token_type): if self.cur_token[0] = = Token_type:self.consume () return True raise Excep tion (' expecting%s; found%s '% (Token_type, self.cur_token[0])) If __name__ = ' __main__ ': prog = ' ' ve CA = [1, 2, 3] VECB = [4, 5, 6] print ' Veca: ', Veca print ' Veca * 2: ', Veca * 2 print ' Veca + 2: ', VEC A + 2 print ' Veca + VECB: ', Veca + VECB print ' Veca + [one]: ', Veca + [one,] print ' Veca * VECB: ', VEC A * VECB print ' Veca: ', Veca print ' VECB: ', Vecb ' ' Lex = veclexer. 
 Veclexer (Prog) parser = Vecparser (Lex) parser.statlist ()

Running the code will get the output from the previous introduction. This interpreter is extremely primitive and implements only basic expression operations, so there is no need to build a syntax tree. If you want to add a control structure to the list language, you must implement a syntax tree that interprets execution on the basis of the syntax tree.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.