PHP automated code auditing technology
0x00
As there is nothing to update in the blog, I will summarize what I have done. As a blog, I will mainly talk about some of the technologies used in the project. At present, there are many PHP automated auditing tools on the market, including RIPS and Pixy open-source tools and Fortify commercial versions. RIPS only has the first version. Because it does not support PHP object-oriented analysis, it is not ideal. Pixy is a data stream-based analysis tool, but only supports PHP4. Fortify is a commercial version. Due to this limitation, there is no way to study it. Domestic Research on PHP automatic auditing is generally done by companies. At present, most of some tools use simple token stream analysis or direct brute force, and use regular expressions for matching, the effect will be very general.
0x01
The technology I want to talk about today is a PHP automated audit implementation idea based on static analysis, and it is also the idea of my project. The Regular Expression effect is definitely not ideal for more effective analysis of variable bases and stains, as well as a good response to various flexible syntax representations in PHP scripts, the idea I introduced is to audit code static analysis technology and data stream analysis technology.
First, I think a valid audit tool should include at least the following modules:
1. Compile the front-end module
The compilation front-end module mainly uses the abstract syntax tree construction and control flow diagram construction methods in the compilation technology to convert source code files into a form suitable for back-end static analysis.
2. Global Information Collection Module
This module is mainly used to collect unified information on the source code files for analysis, such as how many categories of definitions are collected in the audit project, the method names, parameters, and the start and end line numbers of the method-defined code blocks in the class are collected to accelerate subsequent static analysis.
3. Data Stream Analysis Module
This module is different from the data stream analysis algorithm in the compilation technology. It focuses more on the processing of PHP features in the project. When a sensitive function is called during inter-process and intra-process analysis, data stream analysis is performed on the sensitive parameters in the function to track the specific changes of the variable, prepare for subsequent stain analysis.
4. Vulnerability code analysis module
This module analyzes corrupted data based on the global variables and value assignment statements collected by the data stream analysis module. This function is mainly used to obtain the corresponding data stream information based on the dangerous parameters in sensitive sinks, such as the first parameter in the mysql_query function. If this parameter is found to be under user control during backtracking,. If this dangerous parameter has a corresponding encoding and purification operation, it should also be recorded. Trace and analyze the data of dangerous parameters to complete the stain analysis.
0x02
With the module, I used the following process to implement an effective process for automated Auditing:
The general process of the analysis system is as follows:
1. Framework Initialization
First, initialize the analysis framework, mainly collecting information about all user-defined classes in the source code project to be analyzed, including class names, class attributes, class method names, the file path of the class.
These records are stored in the Context class of the global Context class, which is designed in the singleton mode and resident memory for later analysis.
2. Judge the Main File
Next, determine whether each PHP file is a Main file. In PHP, there is no so-called main function. Most PHP files in the Web are divided into two types: Call and definition. PHP files of the definition type are used to define some business, tool, and tool functions, instead of providing access to users, PHP files of the call type are provided for calling. The PHP file of the call type is used to process user requests, for example, the global index. php file. Static analysis is mainly for PHP files that process user requests, that is, Main File. The judgment is based on:
Based on the AST parsing, determine whether the number of lines of class definition and method definition in a PHP file exceeds the limit of all lines of code in the file. If yes, it is regarded as a PHP file of the definition type, otherwise, it is the Main File and added to the list of File names to be analyzed.
3. Construction of AST abstract syntax tree
This project is developed based on the PHP language. For the AST construction, we refer to the PHP Parser, which is currently an excellent php ast construction implementation.
This open-source project is developed based on the PHP language. It can parse most PHP structures, such as if, while, switch, array declaration, method call, and global variables. This completes part of the compilation frontend processing of the project.
4. Build a CFG flow chart
Use the generator builder method in the generator class. The method is defined as follows:
The specific idea is to build CFG recursively. First, enter the nodes set obtained by traversing the AST. In the traversal, the elements (nodes) in the set are determined by type, such as determining whether the node is a branch, jump, or end statement, and build CFG according to the node type.
Here, the jump condition (conditions) of the branch and loop statements must be stored on the Edge in CFG to facilitate data flow analysis.
5. Data Stream Information Collection
For a piece of code block, the most effective and worth collecting information is the value assignment statement, function call, constant (const define), and registered variable (extract parse_str ).
The value assignment statement is used for variable tracking. in implementation, I use a structure to indicate the value and location of the value assignment. Other data information is identified and obtained based on AST. For example, in a function call, determine whether the variable is escaped, encoded, or other operations, or whether the called function is sink (such as mysql_query ).
6. Variable purification and encoding Information Processing
$ Clearsql = addslashes ($ SQL );
Value assignment statement. When the right side is a filter function (user-defined filter function or built-in filter function), the return value of the called function is purified, that is, add addslashes to the purification tag of $ clearsql.
Detects function calls and determines whether the function name is a security function configured in the configuration file.
If yes, add the purification tag to the symbol of location.
7. Process Analysis
If a user's function call is found in the audit, the inter-process analysis must be performed at this time. In the analysis project, locate the code block of the specific method and bring it into the variable for analysis.
The difficulty lies in, how to perform variable backtracking, how to deal with methods with the same name in different files, how to support calling and Analysis of class methods, and how to save user-defined sinks (for example, calling the exec function in myexec, if it has not been effectively purified, myexec should also be considered a dangerous function) and how to classify User-Defined sinks (such as sqli xss xpath ).
The process is as follows:
8. stain analysis
With the above process, the last thing to do is the stain analysis, mainly for some built-in risk functions in the system, such as the echo that may lead to xss. In addition, it is necessary to analyze the dangerous parameters in the dangerous functions effectively, including determining whether the function has been effectively purified (such as escape and regular expression matching ), and develop an algorithm to trace back the previous value assignment or other transformations of the variable. This undoubtedly tests the engineering capabilities of security researchers and is also the most important stage of automated auditing.
0x03
Through the above introduction, you can see that there are many pitfalls to implement an automated audit tool. I also encountered N-plus difficulties in my attempt, and static analysis does have some limitations, such as the process of string transformation that can be easily obtained in dynamic analysis, static analysis is hard to achieve. This is not a technical breakthrough, but a result of the limitations of static analysis itself. Therefore, if you want to make a static analysis with false positives or false negatives, after all, some dynamic ideas are introduced, such as simulating the code in eval, and processing the string change functions and regular expressions. In addition, for some MVC-based frameworks, such as the CI framework, the code is scattered, for example, the data purification code is placed in the extension of the input class, such as PHP applications, I think it is difficult to implement a general audit framework and should be treated separately.