Use PHP-Parser to generate AST abstract syntax tree 0 and preface
Recently, the project process has gradually become clear, but many key technologies are not mastered, and they can only be explored step by step.
Because static code analysis based on data stream analysis is required, front-end work such as lexical analysis and syntax analysis is essential. Yacc and Lex will not be considered any more. I checked the information for one day and found that the two models are more suitable. one is anlr under Java, the other is PHP-Parser generated by php ast.
Anlr is a well-known tool in the field of compilation principles. it is more practical than Yacc and Lex. However, there is only one PHP syntax file, and it takes only half a day to generate a call. it is not suitable. for "$ a = 1" to generate tokens, it turns out to be [$, a, =, 1], unable to identify assignment, too rough, very disappointing.
In contrast, PHP-Parser is more professional. after all, it focuses on the lexical and syntax analysis of PHP.
1. Introduction
The PHP-Parser project homepage is https://github.com/nikic/php-parser. PHP of multiple versions can be perfectly parsed to generate an abstract syntax tree.
For lexical analysis, PHP has a built-in function token_get_all () that can be used to obtain TOKENS as the input for syntax analysis. this open-source project also uses the token stream generated by token_get_all.
2. Installation
The installation is also very simple. here I added the package management tool composer in PHP, and run the following command in the project directory:
Php composer. phar require nikic/php-parser
If you have not downloaded the Composer, run the following command:
Curl-s http://getcomposer.org/installer | php
3. generate AST
After you use composer to add php-parser, you can easily use it.
First, we will introduce some node types defined in PHP-Parser:
(1) PhpParser \ Node \ Stmt is a statement Node without any return structure, such as the value assignment statement "$ a = $ B ";
(2) PhpParser \ Node \ Expr is an expression Node that returns the language structure of a value, such as $ var and func ().
(3) PhpParser \ Node \ Scalar is a constant Node and can be used to represent any constant value. For example, 'string', 0, and constant expression.
(4) some other nodes are not included, such as the parameter Node (PhpParser \ Node \ Arg ).
The names of some node classes are underlined to avoid conflicts with PHP keywords.
The HelloWorld program of PHP-parser is as follows. this code snippet generates AST:
Output result:
Array( [0] => PhpParser\Node\Stmt\Echo_ Object ( [subNodes:protected] => Array ( [exprs] => Array ( [0] => PhpParser\Node\Scalar\String Object ( [subNodes:protected] => Array ( [value] => 1+2 ) [attributes:protected] => Array ( [startLine] => 1 [endLine] => 1 ) ) [1] => PhpParser\Node\Scalar\String Object ( [subNodes:protected] => Array ( [value] => chongrui ) [attributes:protected] => Array ( [startLine] => 1 [endLine] => 1 ) ) ) ) [attributes:protected] => Array ( [startLine] => 1 [endLine] => 1 ) ))
As you can see, the AST has only one Echo _ node, and this node has a subnode exprs, which can be accessed using $ sort TS [0]-> exprs.
Attributes information in a node is used to store startLine, endLine, and comments. You can use the getAttributes (), getAttribute ('startline'), setAttribute (), and hasAttribute () methods for access.
You can use the getLine ()/setLine () method to access startLine (or getAttribute ('startline ')). You can use getDocComment () to obtain the annotation information.
Value on the access node: for example, if the access value is "chongrui", use $ shortts [0]-> exprs [1]-> value.
4. node traversal
It is very convenient to traverse the abstract syntax tree. use the PhpParser \ NodeTraverser class. You can also customize the Visitor object. In actual application, the analysis of PHP source code usually does not know the specific structure of AST. in this case, you need to dynamically determine the type information of each node.
These judgments are uniformly written to MyNodeVisitor. this class inherits a parent class NodeVisitorAbstract, which has some methods:
(1) the beforeTraverse () method is used to reset the value before traversing.
(2) the afterTraverse () method is the same as (1). The only difference is that it is triggered after traversal.
(3) the enterNode () and leaveNode () methods are triggered when accessing each node.
EnterNode is triggered when it enters the node, for example, before accessing the child node of the node. This method returns NodeTraverser: DONT_TRAVERSER_CHILDREN, which is used to skip the child node of the node.
LeaveNode is triggered after nodes are traversed. It can return
NodeTraverser: REMOVE_NODE. in this case, the current node is deleted. If A set of nodes is returned, these nodes are incorporated into the array of the parent node, such as array (A, B, C), and node B is array (X, Y, Z) replace it with array (A, X, Y, Z, C ).
The following code snippets parse $ code, generate an AST, and output the String type when traversing nodes are found.
The result is 1 or 2.
5. other AST representation
Sometimes the AST will be retained for persistent storage. This function is also supported by PHP-Parser.
(1) simple serialization
You can use serialize () and unserialize () for serialization and deserialization to persistently save the AST.
(2) easy-to-read storage
They are perfect print and XML persistent storage, which are not described in detail here. you can refer to the project documentation when necessary:
Https://github.com/nikic/PHP-Parser/blob/master/doc/3_Other_node_tree_representations.markdown
6. Summary
At least in PHP static analysis, PHP-Parser is far better in terms of functionality than anlr. The PHP-Parser will certainly play a major role in building a PHP automated audit system :)~