Mysql source code learning-lexical analysis MYSQLlex

Last Update:2013-11-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lexical Analysis MYSQLlex

After the client sends an SQL statement to the server, the server first performs lexical analysis, and then performs syntax analysis and semantic analysis to construct an execution tree and generate an execution plan. Lexical analysis is the first stage. Although it is not significant in understanding Mysql implementation, it is better to learn it as the basis.

Lexical analysis performs word segmentation on the input statement to parse the meaning of each token. The essence of word segmentation is the matching process of regular expressions. The popular word segmentation tool should be lex, which implements word segmentation through simple rule formulation. Lex is generally used in combination with yacc. For more information about lex and yacc, see Yacc and Lex Quick Start-IBM. For more information, see LEX and YACC.

However, Mysql does not use lex for lexical analysis, but syntax analysis uses yacc, and yacc requires the lexical analysis function yylex. Therefore, we can see the following macro definitions at the beginning of the SQL _yacc.cc file:

/* Substitute the variable and function names .*/

# Define yyparse MYSQLparse

# Define yylex MYSQLlex

Here MYSQLlex is the focus of this article, that is, MYSQL's own lexical analysis program. Source code version 5.1.48. The source code is too long to be pasted up, so it is... In SQL _lex.cc.

The first time we enter the lexical analysis, the default state is MY_LEX_START, which is the starting state. In fact, the macro meaning of state can be guessed from the name, and for example, MY_LEX_IDEN is the identifier. The pseudo code for processing the START status is as follows:

Case MY_LEX_START:

{

Skip Space

Obtain the first valid character c

State = state_map [c];

Break;

}

I am confused. Does Nima have a state_map? Find the place where a value is assigned at the beginning of the function:

Uchar * state_map = cs-> state_map;

Cs ?! No, it's not an anti-terrorism elite !! Under quick monitoring, cs is my_charset_latin1, so it turns out cs is the latin character set, which is short for character set. So the state_map of Shenma can directly determine the status? Find the location where the value is assigned. In the init_state_maps function, the Code is as follows:

/* Fill state_map with states to get a faster parser */

For (I = 0; I <256; I ++)

{

If (my_isalpha (cs, I ))

State_map [I] = (uchar) MY_LEX_IDENT;

Else if (my_isdigit (cs, I ))

State_map [I] = (uchar) MY_LEX_NUMBER_IDENT;

# If defined (USE_MB) & defined (USE_MB_IDENT)

Else if (my_mbcharlen (cs, I)> 1)

State_map [I] = (uchar) MY_LEX_IDENT;

# Endif

Else if (my_isspace (cs, I ))

State_map [I] = (uchar) MY_LEX_SKIP;

Else

State_map [I] = (uchar) MY_LEX_CHAR;

}

State_map [(uchar) '_'] = state_map [(uchar) '$'] = (uchar) MY_LEX_IDENT;

State_map [(uchar) '\ ''] = (uchar) MY_LEX_STRING;

State_map [(uchar) '.'] = (uchar) MY_LEX_REAL_OR_POINT;

State_map [(uchar) '>'] = state_map [(uchar) '='] = state_map [(uchar )'! '] = (Uchar) MY_LEX_CMP_OP;

State_map [(uchar) '<'] = (uchar) MY_LEX_LONG_CMP_OP;

State_map [(uchar) '&'] = state_map [(uchar) '|'] = (uchar) MY_LEX_BOOL;

State_map [(uchar) '#'] = (uchar) MY_LEX_COMMENT;

State_map [(uchar) ';'] = (uchar) MY_LEX_SEMICOLON;

State_map [(uchar) ':'] = (uchar) MY_LEX_SET_VAR;

State_map [0] = (uchar) MY_LEX_EOL;

State_map [(uchar) '\'] = (uchar) MY_LEX_ESCAPE;

State_map [(uchar) '/'] = (uchar) MY_LEX_LONG_COMMENT;

State_map [(uchar) '*'] = (uchar) MY_LEX_END_LONG_COMMENT;

State_map [(uchar) '@'] = (uchar) MY_LEX_USER_END;

State_map [(uchar) '''] = (uchar) MY_LEX_USER_VARIABLE_DELIMITER;

State_map [(uchar) '"'] = (uchar) MY_LEX_STRING_OR_DELIMITER;

First, let's take a look at this for loop. 256 should be 256 characters. The processing of each character should follow the rules below: if it is a letter, state = MY_LEX_IDENT; if it is a number, state = MY_LEX_NUMBER_IDENT, if it is a space, state = MY_LEX_SKIP, and the rest is MY_LEX_CHAR.

After the for loop, some special characters are processed. Because our statement "select @ version_comment limit 1" has a special character @, the state of this character is specially processed, MY_LEX_USER_END.

How can these functions, such as my_isalpha, determine the category of a character? Follow up to see the macro definition:

# Define my_isalpha (s, c) (s)-> ctype + 1) [(uchar) (c)] & (_ MY_U | _ MY_L ))

Wtf, swollen again comes a ctype, c as the subscript of ctype, _ MY_U | _ MY_L as follows,

# Define _ MY_U 01/* Upper case */

# Define _ MY_L 02/* Lower case */

What is stored in ctype? In the ctype-latin1.c source file, we found the initial value of the my_charset_latin1 character set:

CHARSET_INFO my_charset_latin1 =

{

8, 0, 0,/* number */

MY_CS_COMPILED | MY_CS_PRIMARY,/* state */

"Latin1",/* cs name */

"Latin1_swedish_ci",/* name */

"",/* Comment */

NULL,/* tailoring */

Ctype_latin1,

To_lower_latin1,

To_upper_latin1,

Sort_order_latin1,

NULL,/* contractions */

NULL,/* sort_order_big */

Cs_to_uni,/* tab_to_uni */

NULL,/* tab_from_uni */

My_unicase_default,/* caseinfo */

NULL,/* state_map */

NULL,/* ident_map */

1,/* strxfrm_multiply */

1,/* caseup_multiply */

1,/* casedn_multiply */

1,/* mbminlen */

1,/* mbmaxlen */

0,/* min_sort_char */

255,/* max_sort_char */

'',/* Pad char */

0,/* escape_with_backslash_is_dangerous */

& My_charset_handler,

& My_collation_8bit_simple_ci_handler

};

We can see that ctype = ctype_latin1; and ctype_latin1 is:

Static uchar ctype_latin1 [] = {

32, 32, 32, 32, 32, 32, 32, 32, 32, 40, 40, 40, 40, 32,

32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

72, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

132,132,132,132,132,132,132,132,132,132, 16, 16, 16, 16, 16, 16,

16,129,129,129,129,129,129, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 16, 16, 16, 16, 16,

16,130,130,130,130,130,130, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16, 16, 16, 32,

16, 0, 16, 2, 16, 16, 16, 16, 16, 16, 1, 16, 1, 0, 0,

0, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 16, 2, 0, 2, 1,

72, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 16, 1, 1, 1, 1, 1, 1, 1, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 16, 2, 2, 2, 2, 2, 2, 2, 2

};

Once again, we can see that these values are pre-calculated, and the first 0 is invalid, Which is why my_isalpha (s, c) the reason for setting ctype to + 1 in the definition. Through the definitions of _ MY_U and _ MY_L, we can know that these values must be set according to the specific meaning of the corresponding ASCII code. For example, the ASCII code of the character 'a' is 65, which is actually A capital letter, so it must have _ MY_U, that is, 0th characters must be 1, find the first 66th elements (omitted from the first meaningless 0) in the ctype, Which is 129 = 10000001. Obviously, the first 0th digits are 1 (from the right), which is an uppercase letter. The person who writes the code is indeed cool X, so I can't think of it in my life, so I admire it. The State problem has ended.

Continue the lexical analysis. The first letter is "s", its state = MY_LEX_IDENT (IDENTIFIER: IDENTIFIER), break out, continue the loop, and go to the branch of MY_LEX_IDENT in case:

Case MY_LEX_IDENT:

{

Read from s until space

If (the word to be read is a keyword)

{

Nextstate = MY_LEX_START;

Return tokval; // unique identifier of a keyword

}

Else

{

Return IDENT_QUOTED or IDENT; indicates a general identifier.

}

Here, SELECT must be a keyword. Why? The syntax analysis in the next section is described.

After the SELECT statement is parsed, @ version_comment needs to be parsed. The first character is @, And the START branch is entered. state = MY_LEX_USER_END;

Go to the MY_LEX_USER_END branch, as shown below:

Case MY_LEX_USER_END: // end' @ 'of user @ hostname

Switch (state_map [lip-> yyPeek ()]) {

Case MY_LEX_STRING:

Case MY_LEX_USER_VARIABLE_DELIMITER:

Case MY_LEX_STRING_OR_DELIMITER:

Break;

Case MY_LEX_USER_END:

Lip-> next_state = MY_LEX_SYSTEM_VAR;

Break;

Default:

Lip-> next_state = MY_LEX_HOSTNAME;

Break;

Brother smiled. The two @ symbols are system variables ~~, Next, go to the MY_LEX_SYSTEM_VAR branch.

Case MY_LEX_SYSTEM_VAR:

Yylval-> lex_str.str = (char *) lip-> get_ptr ();

Yylval-> lex_str.length = 1;

Lip-> yySkip (); // Skip '@'

Lip-> next_state = (state_map [lip-> yyPeek ()] =

MY_LEX_USER_VARIABLE_DELIMITER?

MY_LEX_OPERATOR_OR_IDENT:

MY_LEX_IDENT_OR_KEYWORD );

Return (int )'@');

The operation is skipped @, next_state is set to MY_LEX_IDENT_OR_KEYWORD, and then MY_LEX_IDENT_OR_KEYWORD is parsed, that is, version_comment. This resolution should be consistent with the SELECT resolution path, but not the KEYWORD. The rest is left to the interested readers (think of a sentence that the artist often says: come together, haha ).

Mysql still has a lot of lexical parsing statuses. If it takes some time to investigate it, but this is not the focus of Mysql, I will just try it out. Next, we will explain the syntax analysis for the preceding SQL statements.

PS: I always want to study Mysql well, and it is always delayed by one thing or another. Of course, it is my own reason. I hope I can stay away this time .....

PS again: this article only represents my learning sentiment. If you have any objection, please correct me.

Excerpted with no code in mind

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mysql source code learning-lexical analysis MYSQLlex

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mysql source code learning-lexical analysis MYSQLlex

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support