Introduction
Recently, I have written a simple compiler for parsing protobuf files with Python, and I deeply ply the simplicity and convenience of lexical analysis and grammatical analysis. Take the waste heat has not been, the mind sober, write down a summary and experience, convenient for you pythoner reference use.
Ply use
Brief introduction
If you are not working on the development of compilers or parsers, you may never have heard of ply. Ply is a python-based Lex and YACC, and its author is the author of the famous Python Cookbook, 3rd edition. Maybe some friends are wondering, I a business development how to need to write the compiler, you programming Daniel said, the central decision, to try more new things. It is also helpful to understand some of the syntactic parsing postures and to parse the complex log or mathematical formulas later.
For children's shoes that do not have a compilation basis, it is highly recommended to understand some basic grammar-related concepts. Wheel Brother strongly recommended parsing techniques and compiled Dragon Killer whale book, personal feeling is not suitable for beginners to learn, in this recommendation Hu Lunjun compiling principle (Electronic industry publishing house), for the concept of examples to explain a lot, is suitable for beginners to learn. Of course, there is no need for special in-depth research, knowledge of lexical analysis and grammatical analysis of the relevant concepts and methods can happily use ply. Document Link: http://www.pchou.info/open-source/2014/01/18/52da47204d4cb.html
In order to facilitate everyone to get started, to solve the multivariate one-time equation group as an example, explain the use of ply.
Example description
The input is a one-time equation with multiple formats x + 4y-3.2z = 7, in order to make the example as simple as possible, do the following restrictions:
Each equation contains the part of the variable on the left side of the equals sign, the constant to the right
Each equation does not limit the number of variables and the order of the variables, but each variable allows only one occurrence
The command rule for a variable is a lowercase string (x y xx yy abc is a valid variable name)
Variable coefficients are limited to integers and floating-point numbers, floating-point numbers do not allow 1.4e8 of the format, coefficients and variables are immediate, and coefficients cannot be 0
between equations and equations;
Children who have studied linear algebra must know that simply by abstracting the equations into matrices, the linear algebra approach can be solved. So simply parse the input equations into a list of matrices and variables on the right, and the rest of the solution can be solved by a linear algebra-related tool.
Analytical
Lexical parsing
Ply of Lex to do lexical parsing, lexical analysis of the theory there are a lot of, but Lex is very intuitive to use, is to use regular expressions to parse the text string into a token, the following code is to use Lex to implement lexical parsing.
From ply import Lex # space tab return these invisible symbols are ignored T_ignore = ' \t\r ' # Parse error directly throws exception def T_error (t): Raise Exception (' ERROR {} A T line {} '. Format (t.value[0], T.lineno) # Record lines, easy error locating def t_newline (t): R ' \n+ ' T.lexer.lineno + = Len (t.value) # Support C + + style \ \ Note def t_ignore_comment (t): R ' \/\/[^\n]* ' # variable command rule def t_variable (t): R ' [a-z]+ ' return t # Constant command Rule def t_constant (t): R ' \d+ (\.\d+)? ' T.value = float (t.value) return t # Input supports the symbol header token, and of course also supports T_plus = R ' \+ ' in the way of defining the plus sign as Tokenliterals = ' +-,;= ' tokens = (' VARI ABLE ', ' CONSTANT ') if __name__ = = ' __main__ ': data = ' x + 2.4y + z = 0;//this is a comment 9y-z + 7.2x =-1; Y-z + x = 8 " lexer = Lex.lex () lexer.input (data) while True: tok = Lexer.token () if not Tok: break print Tok
The parsed token string can be printed directly by running the file, as shown below, with a detailed reference to the ply document.
Lextoken (-, '-', 2,5) lextoken (VARIABLE, ' x ', 2,6) Lextoken (+, ' + ', 2,8) Lextoken (constant,2.4,2,10) Lextoken (VARIABLE, ' Y ', 2,13) Lextoken (+, ' + ', 2,15) Lextoken (VARIABLE, ' z ', 2,17) lextoken (=, ' = ', 2,19) Lextoken (constant,0.0,2,21) Lextoken (;, '; ', 2,22)
Syntax parsing
Yacc in ply is used as a syntactic analysis, although complex lexical analysis can replace simple grammatical analysis, but the parsing of a programming language is not as competent as a complex lexical analysis. Before using YACC, you need to understand the context-independent grammar, this part of the content is too miscellaneous, I only understand a few simple concepts, interested in a look at the compiler principle in-depth understanding.
At present, there are two kinds of methods of grammatical analysis, namely, bottom-up analysis method and top-down analysis method. The so-called top-down sub-game is starting from the beginning of the grammar, according to grammar rules are pushed to a given sentence of a method, or, starting from the root, the downward construction of the syntax tree, until the establishment of each leaf analysis method. The representation algorithm is ll (1), this algorithm grammar parsing ability is not strong, the grammar definition requirements are relatively high, the mainstream compiler is not used. The bottom-up analysis method is to start with the given input string, and proceed to the normalization according to the grammar rules, until the beginning of the grammar, or from the end of the grammar book, step up to the next, until the analysis of the root node. The representative algorithm has the SLR, Lrlr,ply uses is the LRLR.
So we just need to define grammar and protocol actions, and here's the complete code.
#-*-Coding=utf8-*-from ply import (Lex, YACC) # Space tab carriage return these invisible symbols are ignored T_ignore = ' \t\r ' # when parsing an error directly throws an exception def T_erro R (T): Raise Exception (' ERROR {} at line {} '. Format (t.value[0], T.lineno) # Record lines, easy error locating def t_newline (t): R ' \n+ ' t . Lexer.lineno + = Len (t.value) # supports C + + style \ Annotation def t_ignore_comment (t): command rule for R ' \/\/[^\n]* ' # variable def t_variable (t): R ' [A z]+ ' return t # constant command rule def t_constant (T): R ' \d+ (\.\d+)? ' T.value = float (t.value) return T # Input supports the symbol header token, and of course also supports T_plus = R ' \+ ' in the way of defining the plus sign as Tokenliterals = ' +-,;= ' tokens = (' Variab LE ', ' CONSTANT ') # Top-level grammar, the statute of the time equations corresponding to p[1] is a list of variables and coefficients on the left of the equation and the constants on the left of the equation def p_start (p): "" "" "" R_count, var_list = 0, [] for left, _ in p[1]: for con, var_name in Left:if var_name in var_list: Continue Var_list.append (var_name) Var_count + 1 matrix = [[0] * (var_count + 1) F Or _ In Xrange (Len (p[1))] "for counter, EQ in enumerate (p[1]): Left, RIght = eq for con, var_name in Left:matrix[counter][var_list.index (var_name)] = con matrix[counte R][-1] =-right var_list.append (1) p[0] = Matrix, Var_list # equations corresponding to the grammar, each equation with, or; do delimit def p_equations (p): "" "Equatio Ns:equation ', ' equations | Equation '; ' equations | Equation "" "If Len (p) = = 2:p[0] = [p[1]] else:p[0] = [p[1]] + p[3] # A single equation corresponds to the grammar def P_equation (p): "" "" Equation:eq_left ' = ' Eq_right "" "p[0] = (p[1], p[3]) # equation equation left corresponds to the grammar def P_eq_left (p):" "" Eq_left:var_unit Eq_le FT | "" " If Len (p) = = 1:p[0] = [] else:p[0] = [p[1]] + p[2] # Six grammar correspondence Examples: x, 5x, +x, X, +4x, -4y# the form is a tuple, for example: ( 5, ' x ') def p_var_unit (p): "" "var_unit:variable | CONSTANT VARIABLE | ' + ' VARIABLE | '-' VARIABLE | ' + ' CONSTANT VARIABLE | '-' CONSTANT VARIABLE ' "" "len_p = Len (p) If len_p = = 2:p[0] =(1.0, P[1]) elif len_p = = 3:if p[1] = = ' + ': p[0] = (1.0, p[2]) elif p[1] = = '-': P [0] = ( -1.0, p[2]) else:p[0] = (p[1], p[2]) else:if p[1] = = ' + ': p[0] = (p[2], p[ 3]) else:p[0] = (-p[2], p[3]) # equation to the right of the equation the corresponding constant, corresponding example: 1.2, +1.2, -1.2def p_eq_right (p): "" "Eq_right:co Nstant | ' + ' CONSTANT | '-' CONSTANT ' "" If Len (p) = = 3:if P[1] = = '-': p[0] =-p[2] else:p[0] = p[2] E Lse:p[0] = p[1] if __name__ = = ' __main__ ': data = ' x + 2.4y + z = 0; This is a comment 9y-z + 7.2x =-1; Y-z + x = 8 "lexer = Lex.lex () parser = YACC.YACC (debug=true) Lexer.lineno = 1 s = parser.parse (data) Print S
Directly run the file, the resulting output is as follows, then you can solve the values of each variable according to the method of linear algebra
([[ -1.0, 2.4, 1.0, -0.0], [7.2, 9.0, -1.0, 1.0], [1.0, 1.0, -1.0, -8.0]], [' X ', ' y ', ' z ', 1])
Summarize
Thanks to the concise syntax of Python, ply provides us with a powerful syntax analysis tool, and more complex examples can refer to Https://github.com/LiuRoy/proto_parser, This is a simple PROTOBUF parser that I implemented with ply to reduce the frequency of intermediate file generation. There's this artifact, a rowing boat!
The above is the Python development compiler content, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!