1. Overview
PHP (the case PHP version of this article is 7.1.3) as a dynamic scripting language, its Zend virtual machine execution process is: Read into the script string, through the lexical analyzer to convert it to a word symbol, and then the parser to find the syntax structure after the creation of an abstract syntax tree, The opcode is then generated by the static compiler, and the interpreter simulates the machine instruction to execute each opcode.
In the whole process, the generated opcode can be used to improve the performance of the code by using various optimizations such as dead code deletion, conditional constant propagation, function inline and so on to simplify the opcode.
PHP extension Opcache, which supports cache optimizations for generated opcode based on shared memory. On this basis, the static compiler optimization of opcode is added. The optimizations described here typically use the optimizer (Optimizer) to manage, and in the compilation principle, the optimization (Opt pass) is generally used to describe each optimization.
Overall, the optimization is divided into two kinds:
One is the analysis pass, is provides the data flow, the control flow analysis information provides the auxiliary information for the conversion pass;
One is to convert the pass, which will change the generated code, including adding and removing instructions, changing the replacement instructions, adjusting the order of instructions, usually before and after each pass can dump the generated code changes.
In this paper, based on the compiler principle, combined with the opcache extension provided by the optimizer, PHP compiled basic unit Op_array, PHP implementation of the minimum unit opcode as the starting point. This paper introduces the application of compiling optimization technology in Zend Virtual machine, and combs out how to optimize the opcode to improve the performance of code execution by optimizing each step. Finally, the implementation of PHP language virtual machine is given some prospects.
2. Several concept notes
1) Static compilation/Interpretation execution/Instant compilation
Static compilation (compilation), also known as pre-compilation (Ahead-of-time compilation), is referred to as AOT. That is, the source code is compiled into the target code, executed on the platform to support the target code to run.
Dynamic compilation, which is "compiling at run time", relative to static compilation. Typically, an interpreter (interpreter) is used to compile execution, which refers to the interpretation of one piece of the execution source language.
JIT compilation (Just-in-time compilation), that is, the immediate compilation, the narrow sense refers to a piece of code will be executed the first time to compile, and then do not compile directly, it is a special case of dynamic compilation.
The above three types of different compilation execution process, can be described as the general Tathagata:
2) Data flow/control flow
Compiling optimization requires getting enough information from the program, which is the foundation of all compilation optimizations.
The result of the compiler's front end can be either a syntax tree or some kind of low-level intermediate code. But no matter what form the result is, it does not provide much information about what the program does and how to do it. The compiler will find the task of controlling the flow hierarchy in each process left to the control flow analysis, which will determine the global information tasks related to data processing left to the traffic analysis.
Control flow is a formal analysis method for controlling structure information, which is the basis of data flow analysis and dependency analysis. A basic model of control is control flow graph,cfg. The control flow analysis of a single process has two ways of using the necessary node to find the loop and interval analysis.
The data stream collects the semantic information of the program from the program code and determines the definition and use of the variable at compile time by algebraic method. A basic model of data is the Flow chart (GRAPH,DFG). The usual data stream analysis is based on the analysis of the control tree (control-tree-based data-flow analysis), the algorithm is divided into two kinds of interval and structure analysis.
3) Op_array
A stack frame concept similar to the C language, which is the basic unit of a running program (one frame), typically the basic unit of a function call. Here, a function or method, the entire PHP script file, and the string passed to eval for PHP code will be compiled into a op_array.
Implement on Op_array as a structure that contains all the information that the program runs in the basic unit, and of course the opcode array is the most important field of the structure, but it also contains the variable type, comment information, exception capture information, jump information, and so on.
4) opcode
The Interpreter execution (ZENDVM) process executes a minimal optimization opcode within a basic unit op_array, sequentially traversing execution, performing the current opcode, and pre-removing a opcode until the last retrun this special opcode returns to exit.
The opcode here is somewhat similar to the intermediate representation in the static compiler (similar to LLVM IR), usually in the form of a three-address code, which consists of an operator, two operands, and a result of the operation. Two of these operands contain type information. There are five types of information here, namely:
Compiled variables (Compiled Variable, abbreviated CV), compile-time variables are variables defined in the PHP script.
The internal reusable variable (VAR), which is used by ZENDVM for temporary variables, can be shared with other opcode.
Internal non-reusable variable (TMP_VAR), temporary variable for ZENDVM use, cannot be shared with other opcode.
Constant (const), read-only constant, value cannot be changed.
Useless variable (UNUSED). Since opcode uses three address codes, not every opcode has an operation number field, which is used by default when using this variable.
The type information, along with the operator, is used by the actuator to match the selection of a specific compiled C function library template, simulating the generation of machine instructions to execute.
OpCode is characterized by zend_op structure in ZENDVM, and its main structure is as follows:
3.opcache Optimizer Optimizer
After lexical analysis and parsing, the PHP script generates an abstract syntax tree structure and then generates opcode by static compilation. It is a common platform for executing instructions to different virtual machines, depending on the implementation of different virtual machines (but for PHP, most of them refer to ZENDVM).
Before the virtual machine executes the opcode, if the opcode is optimized for more efficient code execution, the role of pass is to optimize the opcode, which acts on opcde, processing opcode, analyzing opcode, Find opportunities for optimization and modify the code that opcode produces higher execution efficiency.
1) Introduction to the ZENDVM Optimizer
In the Zend Virtual machine (ZENDVM), the static code optimizer for Opcache is Zend opcode optimization.
It also provides optimization and commissioning options for the purpose of observing the optimization effect and for ease of debugging:
Optimization level(opcache.optimizationlevel=0xffffffff) optimization, by default, most optimizations are turned on, and the user also controls the shutdown by passing in command line parameters
Optdebuglevel (opcache.optdebuglevel=-1) debug levels, default not open, but provides opcode before and after the optimization of the transformation process
The script context information required to perform the static optimization is encapsulated in the structure Zend_script, as follows:
typedef struct _ZEND_SCRIPT { zend_string *filename; FileName zend_op_array main_op_array; Stack frame HashTable function_table; function Unit symbol table information HashTable class_table; Class Unit symbol table information} zend_script;
The above three content information is passed as an input parameter to the optimizer for its analysis optimization. This, of course, is similar to the usual PHP extension, which is a opcache extension along with the opcode cache module (zend_accel). It embeds three internal APIs within the cache accelerator:
ZendOptimizerstartup optimizer
Zendoptimizescript optimizer for optimized master logic
Zendoptimizershutdown Optimizer-generated resource cleanup
The opcode cache is also a very important optimization for opcode. Its basic application principle is basically as follows:
Although PHP is a dynamic scripting language, it does not directly invoke a complete set of compiler toolchain such as GCC/LLVM, nor does it invoke a pure front-end compiler such as Javac. But every time a PHP script is requested to execute, it undergoes lexical, grammatical, compiled to opcode, and VM execution complete lifecycles.
Removing the first three steps outside of execution is basically the complete process of a front-end compiler, but the compilation process is not fast. If the same script is executed repeatedly, the first three steps compile time will severely constrain the efficiency of the operation, while the opcode generated per compilation does not change. As a result, the opcode can be cached to one place at the first compile time, and the opcache extension caches it to shared memory (Java is saved to a file), and the next time the same script is executed, the opcode is fetched directly from the shared memory, eliminating compilation time.
The opcache extended opcode caching process is as follows:
Because this article focuses on static optimization, the specific implementation of the cache optimization is not expanded here.
2) ZENDVM Optimizer principle
According to "Whale book" ("Advanced compiler Design and implementation"), an optimization compiler is more reasonable to optimize the sequence as follows:
The optimizations involved are from simple constants, dead code to loops, branch jumps, from function calls to inter-process optimizations, from prefetch, cache to soft water, register allocation, and, of course, data flow, control flow analysis.
Of course, the current opcode optimizer does not implement all of the above optimizations, and there is no need to implement machine-related low-level intermediate representation optimizations such as register allocations.
After the Opcache optimizer receives the above script parameter information, it finds the minimum compilation unit. On this basis, according to the optimization of the pass macro and its corresponding optimization level macro, you can achieve a pass registration control.
In the optimization of registration, in order to organize series optimization in a certain sequence, including constant optimization, redundant NOP deletion, function call optimization conversion pass, and data flow analysis, control flow analysis, call relationship analysis and other analysis pass.
Zendoptimizescript and the actual optimized registration Zend_optimize process are as follows:
zend_optimize_script (Zend_script *script, Zend_long optimization_level, Zend _long debug_level) |zend_optimize_op_array (&script->main_op_array, &ctx); A constant operand that iterates through the two-tuple operator, converted from runtime to compile-time (reverse pass2), and actually optimizes pass,zend_optimize to traverse the constant operand of the two-tuple operator, converted from compile-time to Runtime (PASS2) | traverse op_array inside function ze Nd_optimize_op_array (Op_array, &ctx); Traversal of non-user functions within the class Zend_optimize_op_array (Op_array, &ctx); (User function set Static_variables) | If you use the DFA Pass & Call Graph Pass & build call graph to successfully traverse the constant operand of the two-tuple operator, the runtime translates to compile-time (reverse pass2) to set the function The return value information is used by the SSA data flow analysis to use the op_array of the traversal call graph, to do the DFA Analysis Zend_dfa_analyze_op_array traversal call graph Op_array, to do DFA optimization Zend_dfa_optimize_op _array if debugging, traverse the dump call graph each op_array (after the optimization transform) if the open stack correction optimization, the correction stack size adjust_fcall_stack_size_graph again traversal all op in the call graph _array, constant optimization pass2 for the new constant scene after the DFA pass transformation. Call Graph Op_array Resource Cleanup | If the stack correction is optimized to correct the stack size Main_op_array Traverse correction Stack size Op_array | Cleanup resource
This section mainly calls the ssa/dfa/cfg these classes for opcode analysis Pass, involving the pass has BB block, CFG, DFA (CFG, Dominators, liveness, Phi-node, SSA).
The pass for the opcode conversion is centered within the function zend_optimize, as follows:
Zend_optimize |op_array type is zend_eval_code, do not optimize | Open debug, can dump optimization before content | optimization pass1, constant substitution, compile-time operation transformation, Simple Operation Conversion | optimization Pass2 constant operation conversion, conditional jump instruction Optimization | optimization pass3 JUMP instruction optimization, self-increment conversion | optimization of PASS4 function call optimization (mainly for Function call optimization) | optimization pass5 Control flow graph (CFG) optimization | Build Flow graph | Compute data Dependency | partition BB block (basic block, short BB, basic unit of Data flow analysis) optimization of |BB inter-block jump optimization based on data flow analysis in |BB block | unreachable bb Block Delete |bb block merge |bb out-of-block variable check c8/>| rebuild Optimized Op_array (CFG-based) | destructor cfg | optimization PASS6/7 Data Flow Analysis Optimization | Data flow analysis (based on static single-assignment SSA) | Build SSA | build cfg Need to find corresponding BB block sequence number, manage BB block array, calculate BB block successor BB, Mark to reach BB block, calculate bb block precursor bb | calculate Dominator Tree | Whether the identity loop is simplified (mainly depends on the loop back edge) | build SSAS def set, phi node location, SSA construction rename based on Phi node | calculate Use-def CHAIN | Find inappropriate dependency, successor, type and value range value Inference | Data flow optimization based on SSA information, A range of bb block opcode optimization | ssa| optimization PASS9 temporary variable optimization | optimization pass10 redundant NOP instruction removal | optimization PASS11 compression constant table optimization
There are some other optimizations that are as follows:
Optimization of PASS12 correction stack size optimization PASS15 collection constant information optimization pass16 function call optimization, mainly function inline optimization
In addition, pass 8/13/14 may be reserved for pass ID. This shows that there are 13 opcode conversion passes currently available for User option control. However, this does not count against the analysis pass of the data flow/control flow it relies on.
3) Implementation of function inline pass
Usually in the process of function call, due to the need to switch between different stack frames, there will be a stack of space, save return address, jump, return to call function, return value, recycle stack space and a series of function call overhead. Therefore, for the proper size of the function body, the entire function body is embedded inside the caller (Caller), so the actual call to the callee (Callee) is a powerful tool to improve performance.
Because function calls are strongly correlated with the application Binary interface (ABI) of the target machine, the function inline optimization of the static compiler such as GCC/LLVM is basically done before the instruction is generated.
ZENDVM's inline is the replacement optimization of the Fcall directive that occurs after the opcode generation, with a pass ID of 16, which is roughly the following principle:
| Traverse opcode in op_array to find one of do_xcall four opcode | opcode zend_init_fcall| opcode zend_init_fcall_by_namez | new opcode, The opcode is set to Zend_init_fcall, calculates the stack size, updates the cache slot, destructors the constant pool literal, and replaces the current opline opcode| opcode zend_init_ns_fcall_by_name | New opcode, operation code set to Zend_init_fcall, calculate stack size, update cache slot, destructor constant pool literal, replace current opline opcode| try function Inline | Optimize conditional filtering (Each optimization pass is usually more restrictive, and some scenarios are excluded due to lack of sufficient information to optimize or for cost considerations) | method call Zend_init_method_call, direct return not inline | Reference parameter, direct return not inline | default parameter is named constant, direct return not inline | The called function has a return value and adds a zend_qm_assign assignment opcode | The called function has no return value, inserting a zend_nop empty opcode | Delete call opcode called inline function (that is, the previous opcode of the current online)
The following example code, when calling FName (), uses the string variable name fname to invoke the function foo dynamically instead of using the direct call method. You can view the opcode generated by the VLD extension at this time, or open the Opcache debug option (opcache.optdebuglevel=0xffffffff).
function foo () {} $fname = ' foo ';
When debug dump is turned on, it can be seen that the opcode sequence before the function call optimization (only intercept fragments) is:
ASSIGN CV0 ($fname) string ("foo") init_fcall_by_name 0 CV0 ($fname) Do_fcall_by_name
init_fcall_by_name This opcode execution logic is more complex, when the radical inline optimization is turned on, the above instruction sequence can be merged directly into a Do_fcall string ("foo") instruction, eliminating the overhead of indirect calls. This is exactly the same as the opcode generated directly from the call.
4) How to add an optimized pass for Opcache opt
As described above, it is not too difficult to add a pass to the current optimizer, as follows:
Register a pass macro with the Zend_optimize optimizer (for example, add PASS17) and determine its level of optimization.
Before and after the optimization manager has an optimization pass called the pass (for example, adding a tail-recursive optimization pass), it is recommended to add after the Dfa/ssa Analysis pass, since the optimization information obtained at this time is more.
Implement a new pass for custom code conversion (e.g. Zendoptimizefunc_calls for a tail-recursive optimization). For the current pass, the main add conversion pass, where SSA/DFA can also be used to make use of information. Unlike static compilation optimizations, which are generally close to machine-related low-level middle representation optimizations, here are some of the opcode/operand corresponding transformations in the opcode layer.
Before implementing a pass, like a function inline, it is common to first collect the information needed to optimize, and then eliminate some scenarios that do not apply to the optimization (such as non-true tail-recursive calls, parameter problems that cannot be optimized, etc.). When the optimization is achieved, the change of the opcode structure before and after the dump optimization is optimized correctly and is expected (such as the final effect of the tail recursive optimization is the form of the transformation function call Forloop).
4. A little thought
Here are some ideas for a dynamic-based PHP script execution, for informational purposes only.
Since the LLVM from the front-end to the back-end, from static compilation to JIT support of the entire toolchain framework, many language virtual machines are trying to integrate. The ZENDVM of the current PHP7 era has not been adopted, for one reason the virtual machine opcode carries a rather complex analysis work. Compared to the static compiler machine code each instruction usually only do one thing (usually CPU instruction clock cycle), opcode operand (operand) because of the type is not fixed, need to do a lot of type check during the run, the conversion to perform the operation, this greatly affects the execution efficiency. Even if the runtime is JIT-compiled in byte code, the compiled bytecode will be similar to the existing interpreter's opcode processing, and the type needs to be processed and the Zval value cannot be directly present in the register.
Take the function call as an example to compare the existing opcode execution with the statically compiled machine code execution, such as:
Type inference
It is one of the optional methods to improve execution performance to enhance the type inference capability and to provide more type information for opcode execution without changing the existing opcode design.
Multilayer opcode
Since opcode undertakes such complex analytical work, can it be decomposed into multiple layers of opcode normalized intermediate representation (intermediate representation, IR). Each optimization can choose which layer to apply, the intermediate representation of the traditional compiler is divided into high-level intermediate representations (HIR), intermediate Intermediate representations (MIR), and low-level intermediate representations (LIR) based on the amount of information carried, from the abstract high-level language to the machine code.
Pass Management
With regard to OpCode's optimized pass management, as described in the previous whale book, there should be room for improvement. Although the current analysis relies on data flow/control flow analysis, but still lacks such as process analysis optimization, pass management, such as run order, run times, registration management, complex pass analysis of information dump and other mature frameworks such as LLVM still have a large gap.
Jit
The
ZENDVM implements a number of zval values, type conversions, and so on, which can be compiled into machine code for runtime with LLVM, but at the cost of a fast build-up time. Of course, libJIT can also be used.