0. Preface
The process of recent projects is becoming clearer, but many of the key technologies are not mastered and can only be explored in one step.
In order to do static code analysis based on data stream analysis, the work of front-end such as lexical analysis and grammatical analysis is necessary. Yacc and Lex are no longer considered, check the information of the day, found two more suitable, one is Java under the ANTLR, the other is specifically to do PHP ast generation Php-parser.
ANTLR is a relatively well-known tool in the field of compiling principles, and is more practical than YACC and Lex. But the PHP syntax file only one, toss a half-day to generate tuning, found not quite suitable for "$a = 1" generated tokens unexpectedly is [$,a,=,1], unable to identify assignment, do too rough, it is extremely disappointing.
In contrast, Php-parser more professional, after all, focus on the lexical, grammatical analysis of PHP work.
1. Introduction
Php-parser's Project homepage is https://github.com/nikic/PHP-Parser. Multiple versions of PHP can be parsed perfectly, creating an abstract syntax tree.
For lexical analysis, PHP has a built-in function token_get_all () that can be used to get tokens, as input to the parsing, and this open source project is also used by the Token_get_all () generated token stream.
2. Installation
Installation is also very simple, here I am using the PHP package management tool Composer added, in the project directory to execute the following command:
PHP Composer.phar require Nikic/php-parser
If you do not download composer, you should first execute the following command:
Curl-s Http://getcomposer.org/installer | Php
3. Generate AST
After adding Php-parser with composer, it is easy to use.
Let's start by introducing some of the node types defined in Php-parser:
(1) PHPPARSER\NODE\STMT is a statement node with no return information (returns) structure, such as an assignment statement "$a = $b";
(2) phpparser\node\expr is an expression node that can return a value of the language structure, such as $var and Func ().
(3) Phpparser\node\scalar is a constant node that can be used to represent any constant value. such as ' string ', 0, and constant expression.
(4) Some nodes are not included, such as parameter nodes (PHPPARSER\NODE\ARG).
The names of some node classes are underlined to avoid conflicts with PHP keywords.
Php-parser's HelloWorld program is as follows, and the snippet generates an AST:
The output is:
Array ([0] = Phpparser\node\stmt\echo_ Object ([subnodes:protected] = = Array ( [Exprs] = = Array ([0] = phpparser\node\scalar\st Ring Object ([subnodes:protected] = = Array ([value] = 1+2 ) [attributes:protected] = = Array ( [StartLine] = 1 [EndLine] = 1)) [1] = Phpparse R\node\scalar\string Object ([subnodes:protected] = Array ([value] = Chongrui ) [attributes:protected] = = Array ( [StartLine] [1] [endLine] = 1 ) ) ) ) [attributes:protected] = = Array ([startLine] = 1 [endlin E] = 1)))
As you can see, this lesson AST has only one node Echo_, this node has a child node Exprs, which can be accessed using $STMTS[0]->EXPRS.
The attributes information in the node is used to store startline and endline as well as comments. Access can be accessed using getattributes (), getattribute (' StartLine '), SetAttribute (), and Hasattribute () methods.
The start line number StartLine can be accessed through the getline ()/setline () method (also GetAttribute (' StartLine ')). Note Information can be obtained using getdoccomment ().
Access the value on the node: such as access value "Chongrui", use $stmts[0]->exprs[1]->value;.
4. Node traversal
The traversal of the abstract syntax tree is very convenient, using the Phpparser\nodetraverser class. At the same time, the custom visitor object is supported. Because in the actual application, the PHP source code analysis, often do not know the specific structure of the AST, it is necessary to dynamically determine the type of each node information.
These judgments are uniformly written in Mynodevisitor, which inherits a parent class Nodevisitorabstract, which has several methods:
(1) The Beforetraverse () method is typically used to reset a value before traversal before traversing.
(2) The Aftertraverse () method is the same as (1), and the only difference is that it is triggered after the traversal.
(3) the Enternode () and Leavenode () methods are triggered when each node is accessed.
The enternode is triggered when the node is entered, such as before the child node of the node is accessed. This method can return Nodetraverser::D Ont_traverser_children, which is used to skip the child node of the node.
The leavenode is triggered after the traversal of the node is complete. It can return
Nodetraverser::remove_node, in this case, the current node is deleted. If a collection of nodes is returned, the nodes are merged into the parent node's array, such as Array (A,B,C), and the B node is replaced by an array (x, Y, z) and becomes an array (a,x,y,z,c).
The following code fragment parses $code, generates an AST, and, when traversed, outputs when it discovers the string type when traversing a node.
The result is output 1, 2.
5, other AST said
Sometimes the AST is persisted in textual format, which is also supported by the Php-parser feature.
(1) Simple serialization
You can persist the AST by using serialize () and Unserialize () for serialization and deserialization operations.
(2) Easy-to-read form of preservation
They are perfect for printing and XML persistent storage, not detailed here, and you can look at the project's documentation when you need it:
Https://github.com/nikic/PHP-Parser/blob/master/doc/3_Other_node_tree_representations.markdown
6. Summary
At least in the static analysis of PHP, Php-parser is much better than ANTLR in terms of functionality. How to build a PHP automated audit system, this php-parser will certainly play a role:) ~