Before that, I tried a project to automatically generate the so extension for our PHP code,
Compile it into PHP. I call it phptoc.
However, the project is suspended for various reasons.
I wrote this article because there are too few materials. I also summarized my gains for future reference. If you can understand PHP syntax analysis
The Study on PHP source code will go further to the ground ^. ^...
I try to make it easier to understand.
The idea of this project stems from the open-source facebook project HipHop.
In fact, I am skeptical about the performance improvement of this project by 50%-60%. Basically, if PHP uses the APC cache, is its performance low?
Yu HipHop, I am not doing a test yet and dare not assert.
PHPtoc, I just want to free up C programmers and hope to achieve it so that PHPer can use PHP code to write an extension that is close to the PHP extension performance,
The process is as follows: Read the PHP file, parse the PHP code, perform a syntax analyzer on it, generate the corresponding ZendAPI, and compile it into an extension.
Enter the subject
The most difficult part here is the syntax analyzer. Everyone should know that PHP also has its own syntax analyzer. The current version uses re2c and Bison.
Therefore, I naturally use this combination.
If you want to use the PHP syntax analyzer, it is not realistic because you need to modify zend_language_parser.y and zend_language_scanner.l and re-compile them. This is difficult and may affect PHP itself.
Therefore, I decided to re-write a set of my own syntax analysis rules. This function is equivalent to rewriting the PHP syntax analyzer. Of course, some of the less commonly used ones will be discarded.
Re2c & yacc/bison, reference their own corresponding files, and then compile them into a *. c file in a unified manner. Then, gcc compilation will generate
Into our own program. Therefore, they are not syntax analysis programs at all. They just generate an independent c-document for our rules.
The c file is the real syntax analysis program we need. I prefer to call it a syntax generator. For example:
Note: In the figure, a. c is the final code generated by the scanner ..
Re2c scanner, if the scan rule file we write is called rule. l, it will scan the content of the PHP file we wrote, and then according
The rules we write generate different tokens and pass them to the parse.
The (f) lex syntax rules we write, for example, we call it Parse. y.
Yacc/bison is compiled into a parse. tab. h, parse. tab. c file. parse performs different operations based on different tokens.
For example, our PHP code is "echo 1 ″;
Scan has one rule:
"Echo "{
Return T_ECHO;
}
The scanner function scan obtains the "echo 1" string, which loops through the Code. If an echo string is found, it returns the token: T_ECHO as the keyword,
Parse. y and parser. l generate two c files, namely, compiler. c and parse. tab. c, which are compiled together with gcc.
The following is a detailed description.
If you are interested, you can check it out. I have also translated a Chinese version,
I will release it later.
Re2c provides some macro interfaces for us to use. I have simply translated them, but the English level is poor and may be incorrect. If you need the original text, you can check it at the address above.
Interface code:
Unlike other scanner programs, re2c does not generate a complete scanner: You must provide some interface code. You must define the following macros or other corresponding configurations.
YYCONDTYPE
In the-c mode, you can use the-to parameter to generate a file: use an enumeration type as a condition. Each value is used as a condition in the rule set.
YYCTYPE
Used to maintain an input symbol. It is usually char or unsigned char.
YYCTXMARKER
* Expression of the yyctype type. The context of the generated code tracing information is saved in YYCTXMARKER. If the scanner rule needs to use one or more regular expressions in the context, you need to define this macro.
YYCURSOR
* Expression pointer of the YYCTYPE type points to the current input symbol. The generated code matches the symbol. At the beginning, YYCURSOR points to the first character of the current token. At the end, YYCURSOR points to the first character of the next token.
YYDEBUG (state, current)
This is required only when the-d identifier is specified. You can easily debug the generated code when calling a user-defined function.
This function should have the following signature: void YYDEBUG (int state, char current ). The first parameter accepts state. The default value is-1. The second parameter accepts the current input position.
YYFILL (n)
When the buffer needs to be filled in, the generated code will call YYFILL (n): provide at least n characters. YYFILL (n) adjusts YYCURSOR, YYLIMIT, YYMARKER, and YYCTXMARKER as needed. Note that in typical programming languages, n equals the length of the longest keyword plus one. Users can go /*! Max: re2c */define YYMAXFILL at a time to specify the maximum length. If-1 is used, YYMAXFILL will be in /*! Re2c */call a blocking later.
YYGETCONDITION ()
If the-c mode is used, this definition obtains the condition set before the scanner code. This value must be initialized to the YYCONDTYPE Enumeration type.
YYGETSTATE ()
If the-f mode is specified, You need to define this macro. In this case, the scanner will call YYGETSTATE () to obtain the saved status at the beginning, and YYGETSTATE () must return a signed integer. If the value is-1, tells the scanner that this is the first execution; otherwise, this value is equal to the State saved in previous YYSETSTATE (s. Otherwise, the scanner will call YYFILL (n) immediately after the recovery operation ).
YYLIMIT
Expression type * YYCTYPE marks the end of the buffer (YYLIMIT (-1) is the last character of the buffer ). The generated code will constantly compare YYCORSUR and YYLIMIT to determine when to fill the buffer zone.
YYSETCONDITION (c)
This macro is used to set conditions in conversion rules. It is only useful when specifying the-c mode and using conversion rules.
YYSETSTATE (s)
You only need to define this macro when specifying the-f mode. If so, the generated code will call YYSETSTATE (s) before YYFILL (n ), the YYSETSTATE parameter is a signed integer that uniquely identifies a specific YYFILL (n) instance.
YYMARKER
If the type is * YYCTYPE, the generated code saves the Backtracking information to YYMARKER. Some simple scanners may not be used.
A scanner, as its name implies, scans files and identifies key code.
Scanner file structure:
/* # Include file */
/* Macro definition */
// Scan the Function
Int scan (char * p ){
/* Scanner rule area */
}
// Execute the scan function and return the token to yacc/bison.
Int yylex (){
Int token;
Char * p = YYCURSOR; // YYCURSOR is a pointer pointing to the PHP text content.
While (token = scan (p) {// here the pointer p will be moved, one by one to determine if it is the struct defined above...
Return token;
}
}
Int main (int argc, char ** argv ){
BEGIN (INITIAL );//
YYCURSOR = argv [1]; // YYCURSOR is a pointer pointing to the PHP text content,
Yyparse ();
}
BEGIN is a defined macro.
# Define YYCTYPE char // type of the input symbol
# Define STATE (name) yyc # name
# Define BEGIN (n) YYSETCONDITION (STATE (n ))
# Define LANG_SCNG (v) (SC _globals.v)
# Define SCNG LANG_SCNG
# Define YYGETCONDITION () SCNG (yy_state)
# Define YYSETCONDITION (s) SCNG (yy_state) = s
The yyparse function is defined in yacc,
There is a key macro: YYLEX
# Define YYLEX yylex ()
It will execute the yylex Of The scaner Scanner
It may be a bit difficult:
In parser. l, call the parse. y parser function yyparse, which calls the yylex of parser. l to generate the key code token, yylex
Returns the scanner
Token is returned to parse. y. parse executes different codes based on different tokens.
Example:
Rule. l
# Include "example. h"
# Include "parse. tab. h"
Int scan (char * p ){
/*! Re2c
<INITIAL> "<? Php "([\ t] | {NEWLINE })? {
BEGIN (ST_IN_SCRIPTING );
Return T_OPEN_TAG;
}
"Echo "{
Return T_ECHO;
}
[0-9] + {
Return T_LNUMBER;
}
*/
}
Int yylex (){
Int c;
// Return T_STRING;
Int token;
Char * p = YYCURSOR;
While (token = scan (p )){
Return token;
}
}
Int main (int argc, char ** argv ){
BEGIN (INITIAL); // Initialization
YYCURSOR = argv [1]; // put the string you entered into YYCURSOR
Yyparse (); // yyparse ()-"yylex ()-" yyparse ()
Return 0;
}
Such a simple scanner is made,
What about the parser?
The parser I use is flex and bison...
File structure of flex:
% {
/*
The C code segment is copied to the C source file generated after lex compilation.
Some global variables, arrays, and function routines can be defined...
*/
# Include
# Include "example. h"
Extern int yylex (); // It is defined in Rule. l ..
Void yyerror (char *);
# Define YYPARSE_PARAM tsrm_ls
# Define YYLEX_PARAM tsrm_ls
%}
{Definition segment, that is, the place where the token is defined}
// This is the key token program based on which the switch is implemented.
% Token T_OPEN_TAG
% Token T_ECHO
% Token T_LNUMBER
%
{Rule segment}
Start:
T_OPEN_TAG {printf ("start \ n ");}
| Start statement
;
Statement:
T_ECHO expr {printf ("echo: % s \ n", $3 )}
;
Expr:
T_LNUMBER {$ = $1 ;}
%
{User code segment}
Void yyerror (char * msg ){
Printf ("error: % s \ n", msg );
}
In the rule segment, start is the start point. If scan identifies the PHP start tag, T_OPEN_TAG is returned. Then, execute the code in parentheses and output start.
In consumer. l, scan is a while loop, so it will check to the end of php code,
Yyparse switches according to the tag returned by scan, and then goes to the corresponding code. For example, yyparse. y finds that the current token is T_OPEN_TAG,
It maps the macro # line to the location of the 21 rows corresponding to parse. y, T_OPEN_TAG, and then runs
So what does the TOKEN do after it returns to yyparse?
To be more intuitive, I use gdb to trace:
What is yychar 258,258 at this time?
258 is the enumeration type data automatically generated by bison.
Continue
The YYTRANSLATE macro accepts yychar and returns the corresponding value.
# Define YYTRANSLATE (YYX )\
(Unsigned int) (YYX) <= YYMAXUTOK? Yytranslate [YYX]: YYUNDEFTOK)
/* YYTRANSLATE [YYLEX] -- Bison symbol number corresponding to YYLEX .*/
Static const yytype_uint8 yytranslate [] =
{
0, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 27, 2,
22, 23, 2, 2, 28, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 21,
2, 26, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 24, 2, 25, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20
};
Yyparse gets this value and continuously translates,
Bison will generate many Arrays for ing and save the final translate to yyn,
In this way, bison can find the code corresponding to the token.
Switch (yyn)
{
Case 2:
/* Line 1455 of yacc. c */
# Line 30 "parse. y"
{Printf ("start \ n ");;}
Break;
In this way, the token is continuously generated and executed one by one, and then parsed into the corresponding zend function, and the corresponding op is generated and saved in the hash table. These are not the focus of this article,