Lex automatically represents the formal and corresponding action of the input string lexical structure into a host language program, the lexical analysis program, which has a fixed name Yyler, where Yyler is a C language program.
Yylex will recognize the word morphology in the input string and complete the specified action when a word is identified.
See a simple example: Write a Lex source program that converts lowercase letters in the input string to the corresponding large letters.
The procedure is as follows:
%%
[A-z]printf ("%c". yytext[0]+ ' a '-' a ');
The first line in the above program is a delimiter that represents the beginning of the recognition rule. The second line is the recognition rule. The left side is the normal type for identifying lowercase letters. On the right is the action taken when the lowercase letter is recognized: lowercase letters are converted to the corresponding uppercase letters.
Lex works by converting the regular form in the source program to the corresponding deterministic finite automaton, while the corresponding action is inserted into the appropriate place in the Yylox, and the control flow is mastered by the interpreter of the deterministic finite automaton, which is the same as that of the different source programs. For details on how Lex works, please refer to [3], which is not described here.
1. 2 The format of the lex source program
The general format of the lex source program is:
{auxiliary-defined part}
%%
{Recognition Rules section}
%%
{User sub-program section}
The parts that are enclosed in curly braces are not necessary. When there is no "part of the user subroutine", the second percent of the percent can also be omitted. The first percent is necessary because it marks the beginning of the recognition Rule section, and the shortest legal Lex source program is:
%%
Its function is to copy the input string into the output file as is.
The identification rules are the core of Lex's source program. It is a table, the left column is normal, the right column is the corresponding action. The following is a typical identification rule:
Integer printf ("Found KEYWCRD INT");
The rule means to look for the word "integer" in the input string and print out the phrase "Foundkeyword INT" whenever it matches successfully.
Note in the recognition rule, the normal and the action must be separated by a space. If the action part is just a simple C expression, it can be written in the same line on the right, and if the action needs to occupy more than two lines, it should be enclosed in curly braces or it will go wrong. The upside can also be written as:
Integer {printf ("found keyword INT");}
The following section describes the identification Rules section, and then the rest.
1.3 Lex used the regular type
A regular expression that represents a collection of strings. The normal type consists of a literal character and a regular operator. A literal character consists of a basic formal form, which represents a string of symbols;
Normal-type operators make the basic formal combination a complex regular, representing a collection of strings.
For example:
Ab
Represents only the string AB, and
(a B) +
Represents a collection of strings:
{Ab,abab,ababab, ... )。
There are 16 types of regular operators in Lex:
"[]∧-. "*+| ()/${}%<>
When the above operator needs to appear in Normal style as a literal character, it must be preceded by a double quote "or a backslash \, the specific usage is;
XYZ "+ +" or xyz\+\+
Represents a String xyz++
To avoid rote above more than 10 operators, it is recommended to use double quotes or backslashes when using non-numeric or alphabetic characters.
To indicate that the double quotation mark itself is available \ ", to indicate an anti-outside line with" \ "or
As mentioned earlier, in the recognition rule the hollow lattice represents the end of the normal type, so the introduction of spaces in the normal formula must be with the help of double quotation marks or backslashes, but the space appearing in square brackets [] is the exception.
Several special symbols:
\ n is the carriage return line break (newline)
\ t is tab
B is backspace (back space)
These normal operators are described below, respectively, by the functions of the operators.
1. A collection of characters
A set of characters that can be represented by square brackets. Regular type
[a B c]
Matches a single character A or B or C
Most operators in square brackets do not work, except for \-and ∧.
The----operator represents a range of characters, such as
[A-Z 0-9 <>-]
Represents a character set that consists of all lowercase letters, all numbers, angle brackets, and underscores.
If a character set includes-in, it must be written in the first or last position, as
[-+0-9]
Match all numbers and sign
In a character set, the operator ∧ must be written in the first position immediately after the opening parenthesis, which is the function of finding a character set of characters other than ∧ in square brackets that matches a complement to the computer's character set, such as [∧ABC] matching any symbol except A, B, and C.
operator \ In square brackets also plays the function of releasing the operator.
2. Regular type that matches any character
Operator. The normal type of formation matches any character except the carriage return newline.
In Lex's regular style, you can also use octal numbers to represent characters with \, such as
[\40-\176]
Matches all printable characters in the ASCII character set between octal 40 (spaces) and octal 176 (~).
3. An optional expression
to operate. Indicate the sub-formula of the normal type, for example
Ab?c
Match with AC or ABC, that is, B is optional.
4. Closed-Packet operation
The operators * and ten are the closure operators in Lex's regular form, which represent duplicates of a certain subtype in the normal form, such as "A *" represents a collection of strings consisting of 0 or more, while "A +" represents a collection of strings consisting of one or more, and the following two regular expressions are commonly used:
[a-z]+
[A-za-z] [A-za-z 0-9]*
The first is a collection of all lowercase letters, and the second is a collection of alphanumeric strings that begin with a letter.
5. Selection and character groups
operator | indicates selection:
(AB|CD)
Match with AB or CD
The operator () represents a set of characters, noting the difference between () and []. (AB) represents the string AB, while [AB] represents a single character A or B.
Parentheses () are used to denote complex formal expressions, such as:
(ab|cd+). (EF) *
Matches with ABEFEF, EFEF, Cdef, cddd, but not with ABC, ABCD, or abcdef.
6. Contextual Correlation
Lex can identify a range of contexts, so it can represent contextual dependencies to some extent.
If the first character of a regular type is ∧, then only when the regular appears at the beginning of a line to be matched, the beginning of a line refers to the beginning of the entire input string or immediately after a carriage return line, note that ∧ there is another to do is to complement, ∧ these two usages can not conflict.
If the last character of a regular type is $, the expression is matched only when it appears at the end of a line, and at the end of the line is a carriage return immediately after the expression.
The operator/Indicates whether a regular type is matched depending on its post, for example: AB/CD, which matches AB only after AB immediately following the CD. $ is actually a special case, such as the following two normal equivalents: ab$,ab/n
Whether a regular type is matched or executed after a match may depend on the previous text of the expression, and the processing of the preceding correlation is discussed later, using the operator "<>"
7. Repetition and auxiliary definitions
{} represents a repetition when it is surrounded by {}, and when it is surrounded by a name, it represents the expansion of the auxiliary definition. For example: a{1,5}, which represents the collection {A.AA.AAA.AAAA.AAAAA}. {digit} matches a pre-defined string called Dight and inserts a definition into the position where it appears in Normal style, and the auxiliary definition is discussed later.
Finally, the symbol% acts as a separator between the segments of the lex source program.
1. 4 actions in the Lex source program
As I said earlier, when Lex recognizes a word, it needs to do the right action. This section describes the Help that Lex provides to describe the action.
First, it should be noted that those strings in the input string that do not match the regular type in any of the recognition rules will be copied to the output file. Therefore, if the user is not just wishing to copy the output, it must provide an identification rule for each possible word, and provide the corresponding action in it. This is especially important when using Lex as a tool to write a lexical parser for a programming language. The simplest kind of action is to filter out some strings in the input, which is implemented using the empty statement ";" of C.
Example: Filter out all the blanks in the input, tab, and carriage return line characters, the corresponding recognition rules are as follows:
[t n];
If the actions of several adjacent rules are the same, you can use | To represent the action part, which indicates that the action of the rule is the same as the action of the next rule. For example, the upside can also be written as:
"" |
"T" |
"N";
Note that the double quotes in T and n can be removed.
The content of an outer character array yytext is a string that is currently matched by a rule, such as normal [a-z]+ matches all lowercase strings, to know what the specific string matches it, the following rules can be used:
[a-z]+ printf ("% s", yytext);
The action printf ("%s", Yytext) prints out the contents of the character array yytext, which is used frequently, and Lex provides a macro echo to represent it, so the above recognition rules can be written as:
[A-z]+echo;
Note that the default action is to copy the input string into the output file, so what does the above rule do? This is explained in the "Ambiguity of Rules" section.
It is sometimes necessary to know the number of characters in the matched string, and the external variable Yyleng represents the number of characters in the current yytext. For example, to count the number of words in the input string and the number of characters (the word is assumed to be a string consisting of uppercase or lowercase letters), the following rules can be used:
[a-za-z]+ {words++;
Chars+=yyleng;}
Note the first character of the string that is matched and the last character, respectively, are
Yytext[0] and Yytext[yyleng-1]
The following is a description of the C functions that may be used by three Lex to write actions
L Yymore ()
When the next string to be matched is added after the currently recognized string, that is, the next input is not replaced with the contents of Yytext, but after its contents, it must be called in the current action Yymore ()
Example: Suppose a language specifies that its string is enclosed in two double quotes, and if a string contains double quotation marks, precede it with a backslash \. It is not easy to use a regular formula to express the definition of the string, rather than using the following concise formal and yymore () mates to identify:
"[∧"]*{
if (Yytext[yyleng-1]
= = "Yymore" ();
Else
... normal user processing
}
When the input string is "abc\" def, the above rule first matches the first five characters "abc\, then calls Yymore () so that the remainder" Def is added after the previous section, note that the double quotation mark as the end of the string is represented by "normal user proessing" Partially responsible for handling
2. Yyless (N)
If the end of the currently matched string needs to be re-processed, you can call Yyless (n) to "return" the string to the input string for the next match. N in Yyless (n) is the number of characters that are not returned, that is, the number of characters returned is yyleng-n.
In C language, the string "=-a" has ambiguity, it is assumed to interpret it as "=-a" at the same time to give the information, the following identification rules can be used:
=-[a-za-z]{
printf ("Operator (=-)
Ambiguous n ");
Yyless (yyleng-1);
... action for=-...
}
The above rule first prints a message stating that there is a two semantic, returns the letter after the operator to the input string, and finally handles the operator as "=-". Also, if you want to interpret "=-a" as "=-a", simply return the minus sign with the letter to the input string for the next processing, using the following rules:
=-[a-za-z]{
printf ("Operator (=-)
Ambiguous n ");
Yyless (yyleng-1);
... action for = ...
}
3. Yywrap ()
When Lex processes the end of a file in the input string, it calls Yywrap () automatically, and if the Yywrap () return value is 1, then Lex considers the processing of the input to be complete, and if the value returned by Yywrap () is 0,lex it is considered to have a new input string waiting to be processed.
Lex automatically provides a yywrap (), which always returns 1, and if the user wants to have a yywrap () that returns 0, then it can write a yywrap () on its own in the "User subroutine section", which will replace the yywrap () that Lex automatically provides, In the user-written ywrap (), the user can also make other actions that you want to make at the end of the input file, such as printing a table, outputting statistical results, etc., using the example of Yywrap () to be cited later.
1. The ambiguity of the 5 identification rules
Sometimes Lex has more than one rule in the program that matches the same string, which is the two semantics of the rule, in which case Lex has two processing principles:
1) rules that match the most characters will take precedence
2) in rules that match the same number of characters, the first given rule takes precedence
Example: There are two rules given in the following order:
Integer Kegword action ...
[a-z]+ identifier action ...
If the input is integers, it will be treated as an identifier because the rule integer can match only 7 characters, while [a-z]+ can match 8 characters, and if the input string is an integer, it will be treated as a keyword because two rules can match it. But the rule integer is given first.
1.6 The auxiliary definition part of the lex source program
The first part of the Lex source program is the auxiliary definition, so far we have only touched on how to write the second part, that is, the identification of the rules section of the wording, now look at the first part of the wording. In the lex source program, for convenience, the user needs some auxiliary definitions, such as a name to represent a complex formal type. The secondary definition must be preceded by the first percent and must be written from the first column, with the auxiliary definition syntax:
Name translation
For example, a formal auxiliary definition that represents an identifier using the name ident is
IDENT [a-za-z][a-za-z0-9]*
The auxiliary definition is used in the recognition rule by using the operator {} to enclose name, and Lex automatically replaces it with translation, such as the use of the auxiliary definition of the above identifier as:
{Ident}action for identifer ...
Below we use the auxiliary definition to write a lex source program that recognizes integers and real numbers in a Fortran language:
D [19]
E [dede][-+]. {d}+
%%
{d}+ printf ("integer");
{d}+ "." {d}* ({E})? |
{d}* "." {d}+ ({E})? |
{d}+{e} printf ("real");
Note that the preceding auxiliary definitions can be used in the secondary definitions section. For example, D is used when defining e, but the auxiliary definition used must be defined beforehand and cannot have a circular definition. The above rule simply illustrates the use of the auxiliary definition, not the entire rule for identifying the number in FORTRAN, because it cannot handle a similar 35. EQ. I have such a problem that will put 35. EQ. The 35.E in I as a real number, how to solve this problem please the reader to think.
In addition to the auxiliary definitions described above, users also need to use variables in the lex source program, as well as some sub-programs that they write themselves. You've seen two commonly used variables, yytext and Yylong, and several of the subroutines provided by Lex yymore,yyless and yywrap, now describe how users define variables and write subroutines themselves.
Lex is a user-written lex source program converted to a C language program Yylex, in the conversion process, Lex is the user's own variable definition and subroutine copied into the Yylex, lex provisions belong to one of the following three cases of the contents of the copy of the past;
1) A line that starts with a space or tab, which is not part of Lex's recognition rule, is copied into Lex's generated program. If such a line appears before the first percent, it contains a definition that is global, that is, all the functions in a program that Lex produces can use it. If such a line is immediately after the first percent but before all the recognition rules, they are local and will be copied into the corresponding code of the action involved in it. Note These rows must conform to the syntax of the C language and must appear before all recognition rules.
A side effect of this provision is to enable the user to provide a solution to the parser generated by the lex source program or Lex, which must conform to the C language grammar.
2) All rows that are bounded between two lines%{and%}, regardless of where they appear or what is copied past, note that%{and%} must occupy a single row separately. For example
%{
# Defineendoffile 0
#include "Head.h"
int flag
%}
The above measures are provided primarily because there are some lines in the C language, as in the previous example, the macro definition or the file implication line must start with the first column.
3) Any content appearing after the second percent of a percent, regardless of its format, is copied over.
1.7 How to use Lex on a UNIX system assume that a lex source program has been written. How to get a lexical parser from it in UNIX systems.
Lex automatically converts the lex source program into a C-language executable program, which can be run in a file called Lex.yy.c, and the C-language program is compiled with C.
For example, there is a lex source program called Source, and the first step is to convert it to lex.yy.c with the following command:
$ Lex Source
($ is a UNIX prompt). Lex.yy.c then compile with the following command to get the running target code a.
Out
$CC Lex.yy.c-ll
111 of the above command line is the library that calls Lex, and it must be used, see [1].
This section invites readers to refer to Lex (1) in [4]
Lex can easily work with YACC, which is described in the next chapter.
$1.8 Example
This section has two examples of how Lex source programs are written
1. Add 3 to all integers in the input string that are divisible by 7 and the remainder as they are, and look at the following Lex source program:
%%
int k;
[0-9]+{
scanf ( -1, Yytext, "%d", &K);
if (k% 7 = =0)
printf ("%d", k+3);
Else
printf ("% d", K);
}
The above program also has a shortage of places, such as negative integers, just add their absolute value to 3, and like x7,49.63
Items have also been modified to avoid these problems by slightly modifying the source program above.
%%
int k;
-? [0-9]+{
scanf ( -1,yytext, "%d", &k);
printf ("%d", k%7= = 0. K+3;K);
}
-? [0-9]+ ECHO;
[A-za-z] [A-za-z0-9]+ ECHO;
2. The next example counts the number of different lengths of words in the string, and the results in the array lengs, the word is defined as a string composed of letters, the source program is as follows;
int lengs [100];
%%
[A-z]+ lengs[yyleng]++;
"|
n;
%%
Yywrap ()
{
int i;
printf ("Length no.words n");
for (i=0;i<100;i++)
if (lengs[i]>0)
ptintf ("%5d% 10d\n", I, lengs[i];
return (1);
}
In the above stream program, when Lex reads the input string, it only counts and does not output, until the input string is read into the completion of the call Yywrap () output statistical results, for this user provided the Yywrap (), note that the last statement of Yywrap () is the return value 1.
1.9 Further discussion on the processing of contextual correlation
Referring to the context-sensitive representation of Lex's formal formula in $ $, here is a detailed description of the processing context-related measures that Lex provides. The problem to deal with is that some rules take different actions in different contexts, or that the same strings have different interpretations in different contexts. For example, in the programming language, the same equals sign "=", in the description part of the variable is represented as the initial value, then the action should be to modify the symbol table content, and in the statement part equals is the assignment of the value of the statement, then should produce corresponding to the code of the assignment statement. Therefore, the meaning of the equals sign is judged by the context in which it is located. Lex offers two main methods,
1) Use flags to distinguish between different contexts.
A flag is a user-defined variable that a user can set different values in different contexts to distinguish it in which context, so that the recognition rule determines in which context and takes action according to the current value of the flag.
Example: The input string is output as is, but for the word magic, when it appears in a line beginning with the letter A, it changes to first, appears in a line beginning with B and changes to second, and appears in the line beginning with C to third.
The lex source program using flags flag is as follows;
int flag;
%%
∧a {flag= ' a '; ECHO;}
∧b {flag= ' B '; ECHO;}
∧c {flag= ' C '; ECHO;}
n {flag=o; ECHO;}
magic{
Switch (flag)
{
Case ' a ': printf ("first"); Break
Case ' B ': printf ("second"); Break
Case ' C ': printf ("third"); Break
Default ECHO; Break
}
}
2) Use the start condition to distinguish between different contexts
In the lex source program, users can define different start conditions by name. When a starting condition is placed before an identifying rule, the rule is not used until Lex is in the starting condition, otherwise it is not. Lex's current start condition can be changed by the user program (that is, lex action) at any time.
The start condition is defined by the user in the "Secondary definition" section of the lex source program, and the syntax is
%start name1 name2 Name3 ...
Where start can be abbreviated as s or S. The order of the start condition names can be given arbitrarily, and many%start rows can be defined when there are a lot of start conditions.
The starting condition is used in the recognition rule by enclosing it in angle brackets to the left of the regular type of the recognition rule:
expression
To enter a starting condition such as Name1, in an action with a statement
BEGIN name1
It changes Lex's current start condition to name1.
To return to a normal state, use the statement
BEGIN 0
It restores Lex to the initial conditions of the Lex interpreter.
A rule can also work under several starting conditions, such as
<name1,name2,name3> Rule
Enables rule to function under three different start conditions. To make a rule work at all start conditions, do not append any start conditions before it.
Example: Solve the problem in 1), this time with the start condition, the lex source program is as follows:
%start AA BB CC
%%
∧a {ECHO; BEGIN AA;}
∧b {ECHO; BEGIN BB;}
∧c {ECHO; BEGIN CC;}
n {ECHO; BEGIN 0;}
Magic printf ("first");
Magic Printf ("second");
Magic Printf ("third") I
1.10 Lex Source Program format summary
For ease of use, the format of the lex source program, Lex's regular format, etc. are recorded here.
The general format of the lex source program is:
{Definitions}
%%
{Rules}
%%
{User subroutines}
The auxiliary definition section includes the following items;
1) auxiliary definition, in the form of:
Name translation
2) directly according to the copy of the code, the format is:
Space code
3) Direct copy of the code, in the form of:
%{
Code
%}
4) starting condition, in the form of:
%s Namel name2 ...
There are several other items that are not often used and are omitted.
The format of the Recognition Rules section is
Expression action
Where expression must be separated from the action by a space, and if there are more than one line, enclose it in curly braces.
The regular operators used by Lex have the following:
X character X
"X" character x, if operator, does not act as operator
x Ibid.
[XY] character x or y
[X-z] character x, or y, or Z
[∧x] all characters except X
. All characters except the carriage return line
∧x The x that appears at the beginning of a line
x when Lex is in start condition y, X
x$ The x that appears at the end of a line
X. An optional X
x* of 0 or more X
X+ of 1 or more X
X|y x or Y
(x) Character X
X/y character × but only if followed by Y
{XX} auxiliary definition expansion of XX
X (m,n) m to n X