Foreword
Take a look at the source code of the Zend engine underlying PHP, which is a virtual machine for PHP scripts.
There is a SAPI interface on top of PHP, which is responsible for the abstraction of each access layer, such as the implementation of PHP in the Apache module, the implementation of Fast-CGI, and the implementation of the command line. The bottom layer of PHP is the Zend virtual machine. The Zend virtual machine is responsible for parsing the files of the PHP grammar. The upper layer can register functions / variables in the virtual machine and provide them for the virtual machine to call. For example, HTTP requests distributed from Apache pass through the Apache SAPI interface of PHP , It will register some global variables such as $ _COOKIE, $ _GET, and there are no HTTP-related global variables in command line mode.
The Zend engine, like other compilers and interpreters, will undergo lexical analysis / syntactic analysis. After the grammatical analysis, it will generate op code, that is, PHP intermediate code. In the end, the Zend virtual machine executes op code. The first contribution to the Zend engine is the source code analysis of lexical analysis.
PS: The analyzed code is the source package of PHP-5.5.5, download address: http://windows.php.net/downloads/releases/php-5.5.5-src.zip.
lexical analysis
The lexical analysis stage is to scan from character to character in the input stream, identify the corresponding morphemes, and finally convert the source file into a TOKEN sequence, and then throw it to the parser.
From the lexical analysis stage, the lexical analyzer can also detect some errors in the source code. For example, in the lexical analysis stage of the Zend engine, there is such a piece of code:
zend_error (E_COMPILE_WARNING, "Unterminated comment starting line% d", CG (zend_lineno));
When the beginning of / * is detected, but there is no * / at the end, the Zend engine will throw a Waring prompt, but it does not affect the subsequent lexical analysis. The lexical analysis stage generally does not cause serious analytic errors, because the lexical analysis stage The responsibility is to identify the Token sequence, it does not need to know whether there is any connection between Token and Token (that should be the responsibility of the parsing stage). The lexical analyzer of the Zend engine will also throw a fatal parse error to terminate the lexical analysis phase, as follows:
zend_error_noreturn (E_COMPILE_ERROR, "Could not convert the script from the detected"
“Encoding \”% s \ ”to a compatible encoding”, zend_multibyte_get_encoding_name (LANG_SCNG (script_encoding)));
This parsing error is because the encoding of the code detected from the input stream is illegal. Obviously, the entire parsing process should be terminated here.
The lexical analyzer re2c of the Zend engine is generated. The phase of lexical analysis will involve various states, and its variable names are all beginning with yy (described below).
Source highlighting
I found a clear process to analyze how to enter the lexical analysis stage.
Let's take PHP from the command line as an entry to study. Taking the example of HelloWorld, we execute on the command line: php -s HelloWorld.php, the result is as follows:
php -s is a command to highlight source code. The so-called highlight source code is actually a color highlighting of morphemes. We analyzed the entry file and received the command line in the do_cli function in $ PHPSRC / sapi / cli / php_cli.c. Parameter input.
The input of -s corresponds to the highlighted source code.
Immediately after, the code highlighting function of Zend engine is called: zend_highlight.
In $ PHPSRC / Zend / zend_highlight.c, we find the definition of zend_highlight. What zend_highlight () calls is the lexical analyzer lex_scan to get the token, and then add the corresponding color.
Here, you really enter the flow of lexical analysis.
lex lexer
The lex file of the Zend engine is located in $ PHPSRC / Zend / zend_language_scanner.l. If you have re2c installed, you can use the following command to generate the c file:
re2c -F -c -o zend_language_scanner.c zend_language_scanner.l
Our main analysis is the zend_language_scanner.l file. In the lexical parser generated by re2c, I think there are two-dimensional state machines. The first dimension is the string dimension to maintain state, and the second dimension is the character dimension to maintain state. The state machine of the second dimension is the state transition between characters, which we ignore here.
For example, in the Zend engine, when "<? Php" is scanned, Zend will set the current state of the first dimension to ST_IN_SCRIPTING, indicating that we have now entered the state of PHP script parsing. The state of this dimension can be easily used as various preconditions in the lex file. For example, there are many such declarations in the lex file:
The meaning of the expression is: When our lexical parser is in the ST_IN_SCRIPTING state, when it encounters the string "exit", it returns a T_EXIT token token (in the Zend engine, the token macros begin with T_, which actually corresponds to Is a number). You can often see the prompt information at the beginning of T_ from the syntax error prompt information, for example: echo "Hello" World! \ N "; A double quote is added to the string, and a compilation error will occur during runtime. This There is a T_STRING token error inside:
Parse error: syntax error, unexpected ‘World’ (T_STRING), expecting ‘,’ or ‘;’ in /home/raphealguo/tmp/HelloWorld.php on line 2
In the process of scanning characters by the lexical analyzer, it is necessary to record the various parameters of the scanning process and the current state. These variables are named after yy. Commonly used are: yy_state, yy_text, yyleng, yy_cursor, yy_limit
Schematic diagram of changes in the status of each variable before and after scanning.
Before scanning echo:
After scanning echo:
After scanning character by character, a token sequence is finally obtained, which is then parsed by the parser, and then the lex file rules of the Zend engine are written.
Anatomy of a lex file
Zend lexical parsing state
The Zend engine will maintain the state of the scanning process by itself during lexical analysis. In fact, it encapsulates a structure such as yy_text and other variables. We can see many SCNG macro calls in the lex file, for example: SCNG (yy_start) = YYCURSOR;
Locating #define SCNG, you can find that there is such a macro definition on line 91 of the lex file:
/ * Globals Macros * /
#define SCNG LANG_SCNG
We relocated to #define LANG_SCNG at line 56 in the file $ PHPSRC / Zend / zend_globals_macros.h (we ignore the judgment of line 52 ZTS, which is a thread-safe macro definition):
# define LANG_SCNG (v) (language_scanner_globals.v) // Here you can see that in fact the attributes of the global scan state are adjusted during the scanning process, for example SCNG (yy_start) is equivalent to language_scanner_globals.yy_start
extern ZEND_API zend_php_scanner_globals language_scanner_globals;
#endif
It can be seen that the Zend engine maintains a structure of zend_php_scanner_globals (actually a rename of typedef on line 27, which was originally called _zend_php_scanner_globals), and the structure of _zend_php_scanner_globals is defined in $ PHPSRC / Zend / zend_globals. h, you can see that its structure is partially consistent with the variables of the original lex scanner, but it is well packed with some stacks, and there are input and output streams (when parsing PHP files is not necessarily a file input stream, it may also be input from the terminal Command, so it makes sense to wrap an input-output stream here).
Keyword Token
Back to the lex lexical description file, the entry for the lexical scan mentioned earlier is in line 999 int lex_scan (zval * zendlval TSRMLS_DC) of zend_language_scanner.l.
First define some leading regular matches:
For some keywords that do not require complex processing, we scan the corresponding keywords and directly generate the corresponding Token mark, for example:
You can see many such rule declarations in the lex file. <ST_IN_SCRIPTING> refers to the precondition of scanning this keyword is that the lexical parser should be in the state of ST_IN_SCRIPTING. In the lex file, there are several ways to set the current Lexical parser state
#define YYGETCONDITION () SCNG (yy_state)
#define YYSETCONDITION (s) SCNG (yy_state) = s
#define BEGIN (state) YYSETCONDITION (STATE (state))
static void _yy_push_state (int new_state TSRMLS_DC)
{// Push the current state on the stack, and then reset the current state to the new state
zend_stack_push (& SCNG (state_stack), (void *) & YYGETCONDITION (), sizeof (int));
YYSETCONDITION (new_state);
}
Enter PHP parsing state
We know that PHP is embedded. Only the characters contained in the <? Php?> Or <??> Tags will be parsed. The 1732-1805 lines of the lex file are the rules for scanning the starting tags such as <? Php. The source code is as follows:
When scanning for <? Php, the status of the current lexical parser is set to ST_IN_SCRIPTING on line 1790, where HANDLE_NEWLINE is used to increment the current zend_lineno. This variable is used to record the current parsing to which line. Finally return a T_OPEN_TAG to go out.
When a short tag <? = Is encountered, it will first check whether the short_tags in the global attribute are turned on. If not, goto to inline_char_handler to process. Inline_char_handler corresponds to scanning characters that are not inside PHP tags.
On line 1732, another PHP syntax opening tag is defined, which is: <script language = ”php”> echo 2; </ script>
It can be seen from this rule that if other attributes are added to the script, this rule will become invalid, for example: <script language = ”php”> echo 2; </ script> will not perform PHP syntax analysis.
PHP comments
Then we take a look at how the comments in PHP are scanned. First find the 1919 line rule statement about single line comments:
It can be seen that PHP is a single line comment that supports # and // both ways. In the ST_IN_SCRIPTING state, encountering "#" | "//" triggers the scanning of a single-line comment, scanning from the current character to the end of the stream buffer (that is, while (YYCURSOR <YYLIMIT)).
When encountering \ r \ n and \ n, record the currently parsed line (zend_lineno ++) incrementally. For better fault tolerance, PHP is also compatible with such syntax as ///?>, Which means that when the line comment is not commented to ?>, You can see Zend's processing from the branch of case '?'. Let the current pointer YYCURSOR–, go back to?> The previous character, and then jump out of the loop, so that it will not eat "?>" Leading to the later recognition Not close to PHP's closing tag.
The rules for multi-line comments are slightly more complicated and a little bit:
First you can see that / ** is the parsing corresponding to the PHP document declaration (you can write PHP variables in the document, you can see this problem in the variable parsing), and then a while loop scans to the position of * /, if it reaches If the end of the file does not reach * /, then zend_error is a Waring error, but it will not affect the subsequent parsing.
PHP Number Type
From the beginning regular rules, you can know that PHP supports numeric constant declarations of 5 types:
Actually for code The number is actually a character. When the lexical analyzer scans these 5 rules, it needs to parse the corresponding zendlval into a number and store it. At the same time, it returns a number type token token. See the simplest LNUM rule processing:
First check if the current string exceeds the C type long type length. If it does not exceed, directly call strtol to convert the string to long int type.
If it is beyond the range of long, Zend still tries to see if it can be converted. If an overflow occurs (error == ERANGE), then convert the current number to double.
As for DNUM, BNUM, etc. will not occupy space.
PHP variable types
PHP variables start with the dollar sign $, as you can see from the lexical rules:
There are three ways to declare and call variables, $ var, $ var-> prop, $ var ["key"].
Note that yyless is called, and the macro definition for yyless is declared on line 69:
Because "$ var->" has been eaten during the lexical scan, and we only need to extract the variable name "var", we need to return the YYCURSOR pointer to the "-" position of "var->", so we call Yyless (yyleng-3).
Immediately after that, the variable names are copied to zendlval through zend_copy_value and recorded for later insertion into the symbol table during the parsing phase.
Here is another rule about $ var-> prop,
We noticed that there is a strange rule in line 1193. Why can there still be-> under ST_LOOKING_FOR_PROPERTY? After researching it, it turns out that this is to test the second and second of $ var-> prop1-> prop2->
PHP string type
The string type of PHP should be the most complicated in the lexical analysis stage. The string inside PHP can be surrounded by single quotes and double quotes. Single quoted strings are more efficient than double quoted strings. You can see why.
First look at the rules for single quotes:
First notice that b? ['], Can you add a b statement before the string? But in the following code, I didn't see any effect on the string. There is such a description in http://php.net/manual/zh/language.types.string.php:
It turns out that b is used to declare a binary string.
Pay attention to line 2022, why do you want YYCURSOR ++ when you encounter "\\"? Because \ is followed by an escape character in the string, the purpose of YYCURSOR ++ here is to skip the next character, for example: '\ ", if you don't skip the second single quote, we scan to the second The quotes will think the string is over.
The subsequent processing is relatively simple. Take the content of the string from the input stream and return a T_CONSTANT_ENCAPSED_STRING token token.
Double quoted string handling is a bit more complicated:
Variables are supported inside double quotes! $ hello = "Hello"; $ str = "$ {hello} World";
Note that in line 2085, if there is no variable in the double quoted string, a string is returned directly. From this, it can be seen that the efficiency of the double quoted string without the $ is similar to the single quoted string.
If you encounter a variable! At this time, it is necessary to switch to the ST_DOUBLE_QUOTES state:
Now back to the rules for finding variables, the other rules will not take up space. To discuss a detail, we return to line 1871:
Note that when scanning "$ var [", a new state ST_VAR_OFFSET will be pressed, and there is a precondition ST_VAR_OFFSET in the 1889 rule, this is to scan to $ var [$ key] [$ key] In this case, you can also pay attention to the fact that the key of the array variable inside the string is not allowed to use->, for example: $ str = "$ var [$ a-> s]"; This is not Parsing error: syntax error, unexpected '-', expecting ']' in xxx.php
PHP magic variables
PHP magic variables are divided into compile-time replacement and run-time replacement. Lines 1593-1722 in the lexical rules file define the following magic variables:
__CLASS__, __TRAIT__, __FUNCTION__, __METHOD__, __LINE__, __FILE__, __DIR__, __NAMESPACE__
The analysis of magic variables is left to write later, noting that __contruct does not appear in the rules of lexical declaration.
PHP's fault tolerance mechanism
A single-line comment has been described above as a fault-tolerance mechanism. The 1490 and 2432 lines of the grammar file have a fault-tolerance mechanism at the lexical analysis stage.
Conclusion
The article also ignores single-character morphemes (the rule is on line 1454) and the rules for mandatory type conversion (for example: (int) $ str, the rule is on line 1230). Encoding problems and file stream operation problems, then find an article to study the contents of these two pieces. Finally, I couldn't help but sigh, although the familiarity with the compilation principle is not high, but the rules written by re2c are really easy to understand.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.