Lex and YACC application method (1). First knowledge of lex
Papaya 20070301
Lex (lexical analyzar lexical analysis generator), YACC (yet another compiler Compiler
Compiler code generator) is an important tool for lexical analysis and syntax analysis in UNIX. Often used for language scores
Analysis, formula compilation, and other fields. Unfortunately, the introduction of Chinese documents on the Internet is either too simple or too huge,
The entry reference is not significant. This article uses a step-by-step example to learn how to use Lex and YACC from scratch.
<The address of this series of articles: http://blog.csdn.net/liwei_cmg/category/207528.aspx>
I. Preliminary examples of lex (lexical analyzar)
Let's take a look at a simple example (Note: all the instances in this article are completed in rethat Linux ):
A simple Lex file exfirst. l content:
% {
# Include "stdio. H"
%}
%
[/N];
[0-9] + printf ("int: % s/n", yytext );
[0-9] */. [0-9] + printf ("float: % s/n", yytext );
[A-Za-Z] [a-zA-Z0-9] * printf ("var: % s/n", yytext );
[/+/-/* // %] Printf ("OP: % s/n", yytext );
. Printf ("unknown: % C/N", yytext [0]);
%
Execute the command flex parsing in the command line to automatically generate the lex. yy. c file:
[Root @ localhost liweitest] flex exfirst. L
Compile and generate parser executable programs:
[Root @ localhost liweitest] CC-O parser Lex. yy. C-ll
[NOTE: If the-ll link option is not added, the following error will occur during CC compilation, which will be further described later.]
/Usr/lib/GCC-lib/i386-redhat-linux/3.2.2/.../crt1.o (. Text + 0x18): In function '_ start ':
../Sysdeps/i386/Elf/start. S: 77: Undefined reference to 'main'
/Tmp/cciackbx. O (. Text + 0x37b): In function 'yylex ':
: Undefined reference to 'yywre'
/Tmp/cciackbx. O (. Text + 0 xabd): In function 'input ':
: Undefined reference to 'yywre'
Collect2: LD returned 1 exit status
Create the file file.txt to be parsed:
Title
I = 1 plus 3.9;
A3 = 909/6
BCD = 4% 9-333
The generated executable program is used for file parsing.
[Root @ localhost liweitest] #./parser <file.txt
VaR: Title
VaR: I
Unknown: =
INT: 1
OP: +
Float: 3.9
Unknown :;
VaR: A3
Unknown: =
INT: 909
OP :/
INT: 6
VaR: BCD
Unknown: =
INT: 4
OP: %
INT: 9
OP :-
INT: 333
At this point, there will be an intuitive understanding of lex usage:
1. Define the Lex description file
2. Use the lex and flex tools to parse the file Lex. yy. c
3. Use CC to compile Lex. yy. C to generate executable programs
Let's take a look at the complete Lex description file exsec. L:
% {
# Include "stdio. H"
Int linenum;
%}
%
Title showtitle ();
[/N] linenum ++;
[0-9] + printf ("int: % s/n", yytext );
[0-9] */. [0-9] + printf ("float: % s/n", yytext );
[A-Za-Z] [a-zA-Z0-9] * printf ("var: % s/n", yytext );
[/+/-/* // %] Printf ("OP: % s/n", yytext );
. Printf ("unknown: % C/N", yytext [0]);
%
Showtitle ()
{
Printf ("----- Lex example -----/N ");
}
Int main ()
{
Linenum = 0;
Yylex ();/* analyze */
Printf ("/nline count: % d/N", linenum );
Return 0;
}
Int yywrap ()
{
Return 1;
}
Parse and compile:
[Root @ localhost liweitest] flex exsec. L
[Root @ localhost liweitest] CC-O parser Lex. yy. c
[Root @ localhost liweitest]./parser <file.txt
----- Lex example -----
VaR: I
Unknown: =
INT: 1
OP: +
Float: 3.9
Unknown :;
VaR: A3
Unknown: =
INT: 909
OP :/
INT: 6
VaR: BCD
Unknown: =
INT: 4
OP: %
INT: 9
OP :-
INT: 333
Line count: 4
The-ll option is not added here, but it can be compiled. The following describes the Lex description file. L.
II. Introduction to the structure of lex (lexical analyzar) description file
Lex is a lexical analysis program generator that generates Word Recognition Based on the requirements of the lexical rule specification.
The Program identifies each word in the input text. Generally, it can be divided into <definition part> <rule part
Points> <User Subroutine>. The rules and user subprograms are optional.
(1) Definition
The definition part starts with the % {symbol and ends with the %} symbol. It can contain include and declare statements.
Contains C statements. This part is no different from the beginning of a common C program.
% {
# Include "stdio. H"
Int linenum;
%}
(2) Rule Section
The rule starts with the "%" symbol and ends with the "%" symbol, which is a lexical rule. The lexical rule consists of the mode and
Action. The Mode part can be composed of any regular expression, and the action part is composed of C language statement groups.
. Note that Lex will identify the ticket
The word is stored in the yytext [] character data, so the content of this array represents the content of the recognized word.
Predefined variable functions such as yytext will be described in the following sections. If there are multiple actions
Can also be included in.
%
Title showtitle ();
[/N] linenum ++;
[0-9] + printf ("int: % s/n", yytext );
[0-9] */. [0-9] + printf ("float: % s/n", yytext );
[A-Za-Z] [a-zA-Z0-9] * printf ("var: % s/n", yytext );
[/+/-/* // %] Printf ("OP: % s/n", yytext );
. Printf ("unknown: % C/N", yytext [0]);
%
A. Regular Expression in the Rule Section
The rule section is the most complex part of the lex description file. The following lists the regular expression words in some pattern sections.
Meaning:
A-Z, 0-9, A-Z forms the character and number of the Pattern part.
-Specified range. For example, a-Z refers to all characters from A to Z.
. It is used to overwrite the special meaning defined in this expression,
Only take characters.
[] Indicates a character set combination. Matches any character in the brackets. If the first word
If it is ^, it indicates the negative mode. For example, [ABC] matches A, B, and C
.
^ Indicates no.
* Matches 0 or more of the preceding modes.
+ Match one or more of the preceding modes.
? Match 0 or 1 of the above modes.
$ Matches the end of a row as the last character of the pattern.
{} Indicates the number of times a mode may appear. For example, a {1, 3} indicates that a can
Can appear once or three times. [A-Z] {5} indicates the length of 5, composed of a-z
Character. In addition, it can also represent predefined variables.
. Match any character except/n.
() Groups a series of regular expressions. For example: {letter} ({letter} | {digit })*
| Logic or between expressions.
The literal meaning of "some symbols. Metacharacters. For example, "*" is equivalent to [/*].
/Forward match. If "/" in the matching mode is followed by a subsequent expression,
Only the part before "/" in the template is matched. For example, if the input mode is ABC/d,
ABC will match ABC/d, and D will match the corresponding pattern. Enter abce,
Abce will not match ABC/d.
B. Priority of the Rule Section
The rule has the concept of priority. Here is a simple example:
% {
# Include "stdio. H"
%}
%
[/N];
A {printf ("One/N ");};
AA {printf ("Two/N ");};
Aaaa {printf ("Three/N ");};
%
In this case, if you enter the following content:
[Root @ localhost liweitest] # Cat file1.txt
Aaaaaaa
[Root @ localhost liweitest] #./parser <file1.txt
Three
Two
One
When lex analyzes lexical information, it reads characters one by one, matches rules from top to bottom, and reads the first a character.
After traversal, all the three rules are matched successfully. Lex continues to analyze and reads the fifth character.
"Aaaa" only has one rule available, that is, processing by action, and so on. As you can see, Lex selects the longest character.
Matching rules.
If you set the rules
Aaaa {printf ("Three/N ");};
Change
Aaaaa {printf ("Three/N ");};
./Parser <file1.txt output:
Three
Two
Here is a special example:
%
Title showtitle ();
[A-Za-Z] [a-zA-Z0-9] * printf ("var: % s/n", yytext );
%
Enter the title. After Lex resolution, we find that there are still two rules. At this time, Lex will only select the first rule.
Is ignored. This shows the order priority of Lex. Change this example slightly:
%
[A-Za-Z] [a-zA-Z0-9] * printf ("var: % s/n", yytext );
Title showtitle ();
%
During Lex compilation, the system will prompt: warning, rule cannot be matched. When title characters are processed
After the first rule, the second rule becomes invalid.
Modify the first example to deepen your impression!
% {
# Include "stdio. H"
%}
%
[/N];
A {printf ("One/N ");};
AA {printf ("Two/N ");};
Aaaa {printf ("Three/N ");};
Aaaa {printf ("cannot be executed! ");};
./Parser <file1.txt shows the same effect. The last rule will be ignored.
C. variables used in the Rule Section
See the following example:
% {
# Include "stdio. H"
Int linenum;
%}
Int [0-9] +
Float [0-9] */. [0-9] +
%
{Int} printf ("int: % s/n", yytext );
{Float} printf ("float: % s/n", yytext );
. Printf ("unknown: % C/N", yytext [0]);
%
Some variables are added between %} and %. Note that there is no;, which indicates int and float points.
Do not represent a specific meaning. Between Two %, you can use {int} {float} for direct reference to simplify the modulo
Type Definition.
(3) User Subroutine
The last % is followed by the user subroutine, which can contain subprograms written in C language.
Programs can be used in previous actions to simplify programming. Note that,
When the-ll option is not included during compilation, the main function and yywrap must be added (yywrap will be described later ). For example:
...
%
Showtitle ()
{
Printf ("----- Lex example -----/N ");
}
Int main ()
{
Linenum = 0;
Yylex ();/* perform Lex analysis */
Printf ("/nline count: % d/N", linenum );
Return 0;
}
Int yywrap ()
{
Return 1;
}
Iii. Some internal variables and functions of lex (lexical analyzar)
Internal predefined variables:
Yytext char * Currently matched string
The length of the string currently matched by yyleng Int.
Yyin file * Lex current parsing file. The default value is standard output.
Yyout file * output file parsed by Lex. The default value is standard input.
Current number of rows of yylineno int
Internal predefined macros:
Echo # define echo fwrite (yytext, yyleng, 1, yyout) is also unmatched characters
Default action
Predefined internal functions:
Int yylex (void) calls Lex for lexical analysis
Int yywrap (void) is called at the end of a file (or input. If the return value of the function is 1, stop the solution.
Analysis. Therefore, it can be used to parse multiple files. The code can be written in the third paragraph.
You can parse multiple files. The method is to use the yyin file pointer to point to different
File until all files are parsed. Finally, yywrap () returns 1
To end the parsing.
Both Lex and flex are tools for parsing Lex files, which are similar in usage. Flex stands for fast lexical analyzer generator.
It can be seen as an upgraded version of Lex.
For more information, see the flex man manual.
Iv. Summary of lex
Lex is actually a lexical analyzer. Through the configuration file *. l, files are parsed sequentially by character based on regular expressions,
And dynamically update the data parsing status of the memory. However, Lex only supports state and state conversion. Because it does not have a stack,
It is not suitable for profiling shell structures. YACC adds a stack and can easily process
Structure. Lex is better than pattern matching, and YACC is required if more operations are required.