PHP tags, keywords, classes, numbers, and lexical source code for PHP lexical parsing source code analysis

Source: Internet
Author: User

PHP tags, keywords, classes, numbers, and lexical source code for PHP lexical parsing source code analysis

I have never engaged in web applications before. Recently I want to study webshell and find that php syntax is too weird. I just want to look at the PHP kernel lexical analysis code.

Php lexical analysis starts with lex_scan in the zend_language_scanner.l file. The Code starts:

Int lex_scan (zval * zendlval TSRMLS_DC) {// set the first position of the current token to restart: SCNG (yy_text) = YYCURSOR; yymore_restart: // This annotation defines Regular Expression matching of various types. It will be used when the lexical parsing program (such as bison and re2c) converts this file into c code /*! Re2cre2c: yyfill: check = 0; LNUM [0-9] + DNUM ([0-9] * ". "[0-9] +) | ([0-9] + ". "[0-9] *) EXPONENT_DNUM ({LNUM} | {DNUM}) [eE] [+-]? {LNUM }) HNUM "0x" [0-9a-fA-F] + BNUM "0b" [01] + LABEL [a-zA-Z _ \ x7f-\ xff] [a-zA-Z0-9 _ \ x7f-\ xff] * WHITESPACE [\ n \ r \ t] + TABS_AND_SPACES [\ t] * TOKENS [; :,. \ [\] () | ^ & +-/* = %! ~ $ <>? @] ANY_CHAR [^] NEWLINE ("\ r" | "\ n" | "\ r \ n")/* compute yyleng before each rule */<! * >:= Yyleng = YYCURSOR-SCNG (yy_text );

Next, let's look at how lex_scan parses PHP code by parsing PHP labels, keywords, classes, struct, and numbers.

1. Match the php tag

In the zend_language_scanner.l file, php tags are matched, and there are more than one matching rule. The asp_tag switch is also compatible with asp scripts, which is quite strange.

1.1 <script language = php>

First, match the <script language = php> tag. The source code is as follows. No matter how many blank characters are ignored, you can add single quotation marks or double quotation marks to php:

<INITIAL>"<script"{WHITESPACE}+"language"{WHITESPACE}*"="{WHITESPACE}*("php"|"\"php\""|"'php'"){WHITESPACE}*">" {    YYCTYPE *bracket = (YYCTYPE*)zend_memrchr(yytext, '<', yyleng - (sizeof("script language=php>") - 1));    if (bracket != SCNG(yy_text)) {        /* Handle previously scanned HTML, as possible <script> tags found are assumed to not be PHP's */        YYCURSOR = bracket;        goto inline_html;    }    HANDLE_NEWLINES(yytext, yyleng);    ZVAL_STRINGL(zendlval, yytext, yyleng, 0); /* no copying - intentional */    BEGIN(ST_IN_SCRIPTING);    return T_OPEN_TAG;}

Because the <script> tag itself is in html, it determines whether the current html is being scanned. If yes, it will jump to inline_html. Otherwise, it will change the current status to ST_IN_SCRIPTING and return T_OPEN_TAG, indicates that this is a php tag.

1.2 <% = and <%

If this is not the case, <% = and <% will be matched, and the php. whether the asp_tags label in ini is On. If yes, it indicates entering the script and returning T_OPEN_TAG. Otherwise, it will be transferred to inline_char_handler for execution. The source code is as follows:

//<INITIAL>"<%=" {<INITIAL>"<%" {    if (CG(asp_tags)) {        ZVAL_STRINGL(zendlval, yytext, yyleng, 0); /* no copying - intentional */        BEGIN(ST_IN_SCRIPTING);        return T_OPEN_TAG;    } else {        goto inline_char_handler;    }}
1.3 <? = And <?

There is also a short tag <? = And <?, In php 5.3.3, both of them determine whether the short_open_tag is On. The Code is as follows:

<INITIAL>"<?" {    if (CG(short_tags)) {        ZVAL_STRINGL(zendlval, yytext, yyleng, 0); /* no copying - intentional */        BEGIN(ST_IN_SCRIPTING);        return T_OPEN_TAG;    } else {        goto inline_char_handler;    }}

But now we can see the source code of php5.6.3. <? = The short_open_tag flag is no longer required for the tag:

<INITIAL>"<?=" {    ZVAL_STRINGL(zendlval, yytext, yyleng, 0); /* no copying - intentional */    BEGIN(ST_IN_SCRIPTING);    return T_OPEN_TAG_WITH_ECHO;}
1.4 <? Php

The last is the most common <? Php code:

<INITIAL>"<?php"([ \t]|{NEWLINE}) {    ZVAL_STRINGL(zendlval, yytext, yyleng, 0); /* no copying - intentional */    HANDLE_NEWLINE(yytext[yyleng-1]);    BEGIN(ST_IN_SCRIPTING);    return T_OPEN_TAG;}

If none of them match, ANY_CHAR will be matched to determine whether the scan is complete. If yes, 0 will be returned directly. If not, the code for inline_char_handler and inline_html will be executed:

<INITIAL>{ANY_CHAR} {    if (YYCURSOR > YYLIMIT) {        return 0;    }
1.5 inline_char_handler

The following is the inline_char_handler code, while inline_char_handler and inline_html scan the code that is not in the php tag. That is to say, these php code may be included in Code such as html.

The code in inline_char_handler scans the entire string. memchr indicates searching the '<' character from the string in the YYLIMIT-YYCURSOR length starting with YYCURSOR. If yes, it matches '? ',' % ', And' s '. If conditions are met, the loop ends, matching 's' or 'S' will remove YYCURSOR and re-start matching the php tag.

inline_char_handler:    while (1) {        YYCTYPE *ptr = memchr(YYCURSOR, '<', YYLIMIT - YYCURSOR);        YYCURSOR = ptr ? ptr + 1 : YYLIMIT;        if (YYCURSOR < YYLIMIT) {            switch (*YYCURSOR) {                case '?':                    if (CG(short_tags) || !strncasecmp((char*)YYCURSOR + 1, "php", 3) || (*(YYCURSOR + 1) == '=')) { /* Assume [ \t\n\r] follows "php" */                        break;                    }                    continue;                case '%':                    if (CG(asp_tags)) {                        break;                    }                    continue;                case 's':                case 'S':                    /* Probably NOT an opening PHP <script> tag, so don't end the HTML chunk yet                     * If it is, the PHP <script> tag rule checks for any HTML scanned before it */                    YYCURSOR--;                    yymore();                default:                    continue;            }            YYCURSOR--;        }        break;    }
1.6 inline_html

The inline_html Code directly copies the code and then returns T_INLINE_HTML:

inline_html:    yyleng = YYCURSOR - SCNG(yy_text);    if (SCNG(output_filter)) {        int readsize;        size_t sz = 0;        readsize = SCNG(output_filter)((unsigned char **)&Z_STRVAL_P(zendlval), &sz, (unsigned char *)yytext, (size_t)yyleng TSRMLS_CC);        Z_STRLEN_P(zendlval) = sz;        if (readsize < yyleng) {            yyless(readsize);        }    } else {      Z_STRVAL_P(zendlval) = (char *) estrndup(yytext, yyleng);      Z_STRLEN_P(zendlval) = yyleng;    }    zendlval->type = IS_STRING;    HANDLE_NEWLINES(yytext, yyleng);    return T_INLINE_HTML;}
2. simple keyword matching

In lexical parsing, a simple php keyword is matched and returned directly. Note that the <ST_IN_SCRIPTING> label at the beginning indicates the current status in the php script. Several examples are as follows:

<ST_IN_SCRIPTING>"exit" {    return T_EXIT;}<ST_IN_SCRIPTING>"die" {    return T_EXIT;}<ST_IN_SCRIPTING>"function" {    return T_FUNCTION;}<ST_IN_SCRIPTING>"const" {    return T_CONST;}<ST_IN_SCRIPTING>"return" {    return T_RETURN;}

Macros such as T_EXIT/T_FUNCTION/T_CONST are defined in the zend_language_parser.h file. The following are examples:

#define T_INCLUDE_ONCE 261#define T_INCLUDE 262#define T_LOGICAL_OR 263#define T_LOGICAL_XOR 264#define T_LOGICAL_AND 265
3. Processing of classes and struct

First, it is "->" in the ST_IN_SCRIPTING state. It first enters the ST_LOOKING_FOR_PROPERTY state into the stack and sets the current state as the search property:

<ST_IN_SCRIPTING>"->" {    yy_push_state(ST_LOOKING_FOR_PROPERTY TSRMLS_CC);    return T_OBJECT_OPERATOR;}

The functions of the yy_push_state macro are as follows:

static void _yy_push_state(int new_state TSRMLS_DC){    zend_stack_push(&SCNG(state_stack), (void *) &YYGETCONDITION(), sizeof(int));    YYSETCONDITION(new_state);}

In the ST_LOOKING_FOR_PROPERTY status, "->" directly returns T_OBJECT_OPERATOR.

<ST_LOOKING_FOR_PROPERTY>"->" {    return T_OBJECT_OPERATOR;}

If the following is a blank character, ignore it:

<ST_IN_SCRIPTING,ST_LOOKING_FOR_PROPERTY>{WHITESPACE}+ {    ZVAL_STRINGL(zendlval, yytext, yyleng, 0); /* no copying - intentional */    HANDLE_NEWLINES(yytext, yyleng);    return T_WHITESPACE;}

If the LABEL is found, the current status will be restored to the previous status ST_IN_SCRIPTING and the value of the scanned string will be copied. T_STRING:

<ST_LOOKING_FOR_PROPERTY>{LABEL} {    yy_pop_state(TSRMLS_C);    zend_copy_value(zendlval, yytext, yyleng);    zendlval->type = IS_STRING;    return T_STRING;}

If it is another character, the previous state is restored and scanning is started again:

<ST_LOOKING_FOR_PROPERTY>{ANY_CHAR} {    yyless(0);    yy_pop_state(TSRMLS_C);    goto restart;}
4. process the number in December 4.1.

The first step is to skip all the numbers 0 at the beginning and then determine whether the length of the long type is exceeded. If the length does not exceed the length, it is directly converted to the long decimal type, if the value exceeds the value of the double type, T_DNUMBER is returned:

<ST_IN_SCRIPTING>{BNUM} {    char *bin = yytext + 2; /* Skip "0b" */    int len = yyleng - 2;    /* Skip any leading 0s */    while (*bin == '0') {        ++bin;        --len;    }    if (len < SIZEOF_LONG * 8) {        if (len == 0) {            Z_LVAL_P(zendlval) = 0;        } else {            Z_LVAL_P(zendlval) = strtol(bin, NULL, 2);        }        zendlval->type = IS_LONG;        return T_LNUMBER;    } else {        ZVAL_DOUBLE(zendlval, zend_bin_strtod(bin, NULL));        return T_DNUMBER;    }}
4.2 pure digit (octal or decimal)

Pure numbers may be in decimal or octal format. First, determine whether the length exceeds the decimal length, no direct conversion is exceeded (the strtol function calls the number starting with 0 as the octal number). If the number is exceeded, if the first character is 0, the number is converted to the double type as the octal number, otherwise, convert it to double type as a decimal number:

<ST_IN_SCRIPTING>{LNUM} {    if (yyleng < MAX_LENGTH_OF_LONG - 1) { /* Won't overflow */        Z_LVAL_P(zendlval) = strtol(yytext, NULL, 0);    } else {        errno = 0;        Z_LVAL_P(zendlval) = strtol(yytext, NULL, 0);        if (errno == ERANGE) { /* Overflow */            if (yytext[0] == '0') { /* octal overflow */                Z_DVAL_P(zendlval) = zend_oct_strtod(yytext, NULL);            } else {                Z_DVAL_P(zendlval) = zend_strtod(yytext, NULL);            }            zendlval->type = IS_DOUBLE;            return T_DNUMBER;        }    }    zendlval->type = IS_LONG;    return T_LNUMBER;}
4.3 hexadecimal number

As with binary, first exclude the first two characters "0x" and remove all the first 0. Then, if the value exceeds 0x7FFFFFFF, it is converted to the double type. Otherwise, it is converted to the long type:

<ST_IN_SCRIPTING> {HNUM} {char * hex = yytext + 2;/* Skip "0x" */int len = yyleng-2; /* Skip any leading 0 s */while (* hex = '0') {hex ++; len --;} // if (len <SIZEOF_LONG * 2 | (len = SIZEOF_LONG * 2 & * hex <= '7 ')) {if (len = 0) {Z_LVAL_P (zendlval) = 0;} else {Z_LVAL_P (zendlval) = strtol (hex, NULL, 16 );} zendlval-> type = IS_LONG; return T_LNUMBER;} else {ZVAL_DOUBLE (zendlval, zend_hex_strtodd (hex, NULL); return T_DNUMBER ;}}
4.4 decimal and scientific notation

Convert decimal and scientific notation to double type directly:

<ST_IN_SCRIPTING>{DNUM}|{EXPONENT_DNUM} {    ZVAL_DOUBLE(zendlval, zend_strtod(yytext, NULL));    return T_DNUMBER;}

This analysis shows a relatively simple process. In the next article, we will analyze complicated quotation marks matching and content parsing in quotation marks. I wonder why PHP should support so many operations on strings in quotation marks.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.