84-lexical parsing of virtual machines
Language is a kind of expression that people communicate and communicate in a broad sense. Each language has its own symbols, expressions and rules. In the case of programming languages, it is also made up of specific symbols, specific expressions and rules. The role of language is communication, whether it is natural language, or programming language, the difference is that natural language is the tool of communication between people, and programming language is the communication channel between human and machine. The history of programming languages is also very short in relation to natural languages, although the programming language was created on the basis of the historical giants, but it was still small and a child. It can only be translated according to the instructions given by the programmer into the language that the corresponding machine can recognize. It is equivalent to a conversion tool that transforms people's knowledge or business logic into machine code (the language of the machine) and allows it to perform the corresponding operation. These directives are rules, conventions, and conventions that are handled by the programming language.
In the case of the PHP language, it is also a set of directives that conform to certain rules. After programmers have their own ideas in the PHP language, through the PHP virtual machine to translate these PHP instructions into the C language (can be understood as a lower instruction set) instructions, and C language will be converted to assembly language, the final assembly language will be based on the rules of the processor into machine code execution. This is a more high-level abstraction of the continuous materialization, continuous refinement of the process.
Here, we discuss how the PHP virtual machine translates the PHP language into the C language. The conversion from one language to another is called compilation, which can be called the source language and the target language, respectively. This compilation process occurs in the target language at a lower level (or lower) than the source language. The compiling process of language conversion is done by the compiler, and the encoder is usually divided into a series of processes: Lexical analysis, syntactic analysis, semantic analysis, intermediate code generation, code optimization, target code generation, and so on. The role of the previous phases (lexical analysis, syntactic analysis, and semantic analysis) is to analyze the source program, which we can call the compiler's front end. The latter phases (intermediate code generation, code optimization, and target code generation) are used to construct the target program, which we can call the compiler's backend. A language is known as a compiled class language, generally because there is a translation process before the program executes, and the key point is that there is a completely different equivalence program generated in the form. PHP is called the interpretation of the language, because there is no such a program generated, it produces intermediate code, which is just an internal data structure of PHP.
Here we will introduce the two stages of the PHP compiler's front end, syntax analysis, parsing, a stage in the backend, and intermediate code generation.
In the preceding, we mentioned that the compilation process of language conversion is divided into six stages: lexical analysis, syntactic analysis, semantic analysis, intermediate code generation, code optimization, target code generation, etc. Whether it is a compiled or interpreted language, scanning (lexical analysis) is always the first step in translating a program into a target language. The function of lexical analysis is to decompose the whole source program into a single word, which can reduce the number of individuals who need to deal with the analysis work, and prepare for the analysis of grammar. In addition to splitting the work, it also undertakes the process of cleaning the source program, such as clear space, clear comments and so on. Lexical analysis as the first step in the compilation process, in the industry already have a variety of mature tools, such as PHP in the beginning of the use of Flex, then to RE2C, MySQL lexical analysis using the flex, in addition to the UNIX system standard lexical analyzer, such as Lex. These tools will read into a stream of input strings that represent the lexical analyzer rules, and then output the lexical parser source code that is actually done in C. Here we only introduce PHP's current version of the lexical analyzer, re2c.
RE2C is a scanner maker that can create very fast and flexible scanners. It can produce high-efficiency code, based on C, and can support C + + code. Unlike other scanners, it focuses on generating efficient code (as well as his name) for regular expressions. Therefore, this is more widely used than the traditional lexical analyzer. You can get the source code in SourceForge.net.
PHP at the very beginning of the lexical parser is the use of Flex, later instead using RE2C. The Zend/zend_language_scanner.l file in the source directory is a re2c rule file, and if you need to modify the rule file, you need to install RE2C to recompile and generate a new rule file.
RE2C Call Mode:
re2c [-bdefFghisuvVw1] [-o output] [-c [-t header]] file
Let's look at re2c in a simple example. The following is a simple scanner, its role is to determine the given string is a number/lowercase letter/size letter. Of course, there is no such thing as input error judgment and other abnormal operation processing. Examples are as follows:
#include <stdio.h>char *scan(char *p){#define YYCTYPE char#define YYCURSOR p#define YYLIMIT p#define YYMARKER q#define YYFILL(n) /*!re2c [0-9]+ {return "number";} [a-z]+ {return "lower";} [A-Z]+ {return "upper";} [^] {return "unkown";} */}int main(int argc, char* argv[]){ printf("%s\n", scan(argv[1])); return 0;}
If you are in an Ubuntu environment, you can execute the following command to generate the executable file.
re2c -o a.c a.lgcc a.c -o achmod +x a./a 1000
The program will output number at this time.
Let's explain some of the re2c we've used to deal with the macros.
- Yyctype type used to hold input symbols, typically char and unsigned char
- Yycursor points to the current input marker-when it starts, it points to the first character of the current tag, and when it ends, it points to the first character of the next tag
- Yyfill (n) when the generated code needs to reload the cached markup, it calls Yyfill (n).
- Yylimit the last character of the cache, the generated code will repeatedly compare Yycursor and yylimit to determine if the buffer needs to be refilled.
Referring to the description of the previous few logos, you can clearly understand the generated A.C file, of course, re2c not just the markup shown in the above code, this is just a simple example, more identification instructions and help information please go to re2c Help document: Http://re2c.org/manual.html.
We look back at the PHP Word law and file ZEND_LANGUAGE_SCANNER.L. You will find the biggest difference between the simple example and the previous one is that there is a conditional expression in front of each rule.
NOTE re2c in the conditional expression related macros are yysetcondition and yygetcondition, respectively, to set the condition range and get the condition range. There are 10 types of lexical rules in PHP, all in the Zend_language_scanner_def.h file. This file is not handwritten, but is re2c automatically generated. If you need to build and use conditional expressions, you need to add the-C and-t parameters when you are translating to C.
In the lexical parsing of PHP, it has a global variable: language_scanner_globals, this variable is a struct, records the current RE2C parsing state, file information, parsing process information, etc. It is defined directly in the Zend_language_scanner.l file as follows:
#ifdef ZTSZEND_API ts_rsrc_id language_scanner_globals_id;#elseZEND_API zend_php_scanner_globals language_scanner_globals;#endif
C code written in the Zend_language_scanner.l file is copied directly to the newly generated C code file when the C code is generated using RE2C. This variable runs through the whole process of PHP lexical parsing, and some RE2C implementations depend on this, such as the storage and acquisition of conditional expressions mentioned earlier, and the need to assist with this variable, and we look at the definitions of the two macros in the PHP lexical:
// 存在于zend_language_scanner.l文件中#define YYGETCONDITION() SCNG(yy_state)#define YYSETCONDITION(s) SCNG(yy_state) = s#define SCNG LANG_SCNG// 存在于zend_globals_macros.h文件中# define LANG_SCNG(v) (language_scanner_globals.v)
Combining the previous global variables and the definition of conditional expression macros, we can know that the lexical parsing of PHP is through global variables in a single parsing process. So what exactly is this conditional expression used for? Let's take a look at the following example. This is a simple string recognizer that can be identified as ending, identifying characters, numbers, and so on. It uses the conditional expression of RE2C, the code is as follows:
#include <stdio.h> #include "demo_def.h" #include "demo.h" Scanner scanner_globals; #define Yyctype Char#define Yyfill (N) #define State (name) yyc# #name # define BEGIN (state) yysetcondition #define LANG_SCNG (v) (scanner _GLOBALS.V) #define SCNG lang_scng#define yygetcondition () scng (yy_state) #define Yysetcondition (s) scng (yy_state) = s#d Efine yycursor scng (yy_cursor) #define YYLIMIT scng (yy_limit) #define Yymarker scng (yy_marker) int scan () {/*!re2c <INITIAL> "<?php" {BEGIN (st_in_scripting); return t_begin;} <st_in_scripting>[0-9]+ {return t_number;} <st_in_scripting>[\n\t\r]+ {return t_whitespace;} <ST_IN_SCRIPTING> "Exit" {return t_exit;} <st_in_scripting>[a-z]+ {return t_lower_char;} <st_in_scripting>[a-z]+ {return t_upper_char;} <ST_IN_SCRIPTING> "?>" {return t_end;} <st_in_scripting>[^] {return t_unknown;} <*>[^] {return t_input_error;} */}void Print_token (int token) {switch (token) {case t_begin:printf ("%s\n", "BEGIN"); Case t_number:printf ("%s\n", "number"); Case t_lower_char:printf ("%s\n", "LOWER CHAR"); Case t_upper_char:printf ("%s\n", "UPPER CHAR"); Case t_exit:printf ("%s\n", "EXIT"); Case t_unknown:printf ("%s\n", "UNKNOWN"); Case t_input_error:printf ("%s\n", "INPUT ERROR"); Case t_end:printf ("%s\n", "END"); }}int Main (int argc, char* argv[]) {int token; BEGIN (INITIAL); Global initialization, which needs to be placed before the scan call scanner_globals.yy_cursor = argv[1]; Use the first parameter entered as the string to parse while (token = scan ()) {if (token = = T_input_error) {printf ("%s\n", "Input Erro R "); Break } if (token = = t_end) {printf ("%s\n", "END"); Break } print_token (token); } return 0;}
As with the previous simple example, if you are in a Linux environment, you can use the following command to generate the executable file
re2c -o demo.c -c -t demo_def.h demo.lgcc demo.c -o demo -gchmod +x demo
When we used RE2C to generate C code, we used the-c-t demo_def.h parameter, which means that we used the conditional expression pattern to generate the conditional definition header file. In the main function, we need to initialize the condition state before calling the scan function to set it to the initial state. The initial status is then identified directly during the scan and then matched
84-lexical parsing of virtual machines