How to Use Lex and Yacc (1)

Source: Internet
Author: User
Tags repetition

I. Lexical analyzer LEX usage

1.1 Lex Overview

Programming Languages have evolved from machine languages to advanced languages such as pascal and C, so that people can get rid of machine-related details for programming. However, when writing a program in advanced languages, the programmer must tell the computer system in detail how to solve a certain problem, which is also a complicated task to some extent.

People want to have a new language-a very advanced language in which programmers only need to tell the computer system what problems to solve, the computer system can automatically seek a solution to the problem from the Problem description, or automatically convert the Problem description to a program in a certain advanced language such as C and FORTRAN. This program can solve the given problem. Although this kind of hope has not completely become a reality, it has already been partially implemented in some specific problem fields.

The Lex to be introduced here and the Yacc to be introduced in the next chapter are two very advanced languages in the field of compiling program design. They can be used to easily describe lexical analyzer and syntax analyzer, and automatically generate the corresponding high-level language (C) program.

Lex is an Automatic Generation System of the lexical analyzer (scanner), which is 1.1.

The Lex source program is written in a problem-oriented language. The core of this language is a regular expression (Regular Expression) that describes the lexical structure of input strings. In this language, users can also describe the actions to be completed when a word form is identified. For example, when a keyword is identified in the lexical analyzer of an advanced language, it should return the internal encoding of this keyword to the syntax analyzer. Lex is not a complete language. It is just an extension of a high-level language (called the host language of lex), So lex does not design a new language for describing actions, instead, it uses its host language to describe the action. We only introduce how C is used as the host language of lex. in Unix systems, Ratfor, an improved form of FORTRAN, can also be used as the host language of lex.

Figure 1.1 Lex

Lex automatically converts the regular expression of the input string-word structure and corresponding actions into a host language program, that is, the lexical analysis program, which has a fixed name yyler, here, yyler is a C language program.

Yylex identifies the word form in the input string and completes the specified action when recognizing a word form.

Let's look at a simple example: Write a lex source program that converts lowercase letters in the input string into corresponding uppercase/lowercase letters.

The procedure is as follows:

%

[A-z] printf ("% c". yytext [0] + 'a'-'A ');

The first line % in the above program is a separator, indicating the start of the recognition rule. The second line is to identify rules. On the left is the regular expression for identifying lower-case letters. On the right is the action taken to recognize lower-case letters: Convert lower-case letters into corresponding upper-case letters.

The operating principle of Lex is to convert the regular expression in the source program into the corresponding deterministic finite automaton, and insert the corresponding action to the appropriate place in yyox, the control flow is controlled by the interpreter of the finite automatic mechanism. Different Source programs use the same interpreter. For details about how lex works, refer to [3.

1.2 lex source program format

The general format of the lex source program is:

{Auxiliary definition part}

%

{Recognition rule part}

%

{User Subroutine part}

Each part enclosed in curly braces is not mandatory. If there is no "User Subroutine", the second % can also be omitted. The first % is required because it marks the beginning of the identification rule section. The shortest legal lex source code is:

%

It is used to copy the input string to the output file as is.

All identification rules are at the core of the Lex source program. It is a table, the left column is regular, and the right column is the corresponding action. The following is a typical recognition rule:

Integer printf ("found keywcrd INT ");

This rule is used to search for the word "integer" in the input string, and print the word "foundkeyword INT" whenever the match is successful.

Note that regular expressions and actions must be separated by spaces in recognition rules. If the action part is a simple C expression, it can be written in the same row on the right of the regular expression. If the action needs to occupy more than two rows, it must be enclosed in curly brackets; otherwise, an error will occur. You can also write it as follows:

Integer {printf ("found keyword INT ");}

The following describes how to write the identification rules and then the rest.

1.3 regular formula for Lex

A regular expression represents a string set. The regular expression consists of body characters and regular operators. The body characters constitute a basic regular expression, indicating a certain symbol string;

Regular operators combine basic regular expressions into complex regular expressions, indicating a set of strings.

For example:

AB

Only indicates the string AB, while

(A B) +

String collection:

{AB, Abab, ababab ,...).

There are 16 regular operators in Lex:

"\ [] Begin -? • * + | ()/$ {}% <>

When the preceding operator needs to appear as a body character in the regular expression, it must be enclosed by double quotation marks or backslash \. The specific usage is;

Xyz "++" or xyz \ +

String xyz ++

To avoid memorizing more than 10 operators, we recommend that you use double quotation marks or backslash when using non-numeric or letter characters.

To indicate that double quotation marks are available \ ", to indicate that the external quotation marks are used \" or \\

As mentioned above, spaces in the identification rules indicate the end of the regular expression. Therefore, to introduce spaces in the regular expression, double quotation marks or backslash must be used, except for spaces that appear in square brackets.

Several special symbols:

\ N is newline)

\ T is a tab

\ B is the backspace)

The following describes the preceding regular operators based on their functions.

1. Character Set

Square brackets are used to represent a set of characters. Regular

[A B c]

Match a, B, or c

Most operators in square brackets do not work, except \-and operator.

Operator ---- indicates the character range, for example

[A-z 0-9 <>-]

Character set consisting of all lowercase letters, all numbers, angle brackets, and underscores.

If a character set contains-, you must write it in the first or last position, as shown in figure

[-+ 0-9]

Matches all numbers and positive and negative numbers.

In character sets, the operator delimiter must be written at the first position, that is, after the left square brackets, it is used to obtain the character set consisting of characters other than the delimiter in square brackets, which is a supplement to the character set of the computer, for example, [Symbol abc] matches any symbol except a, B, and c.

Operator \ also plays the role of unoperator in square brackets.

2. Regular Expression matching any character

Operator. The regular expression matches any character except the carriage return or line break.

In the regular expression of lex, you can also use an octal number and \ To represent characters, such

[\ 40-\ 176]

And all characters in the ASCII character set between octal 40 (Space) and octal 176 (~) And can be printed.

3. Optional expressions

Calculated? Indicates that there are dispensable sub-formulas in the regular expression, such

AB? C

Match with ac or abc, that is, B is dispensable.

4. closure operations

Operator * and operator 10 are Closure Operators in the regular formula Lex. They indicate the repetition of a subexpression in the regular formula, for example, "a *" indicates a set of strings composed of 0 or more a, and "a +" indicates a set of strings composed of one or more, the following two regular expressions are commonly used:
[A-z] +

[A-Za-z] [A-Za-z 0-9] *

The first is a collection of all lowercase letters, and the second is a collection of letters and numbers starting with a letter.

5. Selection and character group

Operator | indicates selection:

(AB | cd)

Match with AB or cd

The operator () represents a group of characters. Note the difference between () and. (AB) indicates the string AB, while [AB] indicates a single character a or B.

Parentheses () are used to represent complex regular expressions, such:

(AB | cd + )? (Ef )*

It matches with abefef, efef, cdef, and cddd, but does not match abc, abcd, or abcdef.

6. Context relevance

Lex can recognize a certain range of contexts, so it can indicate contextual relevance to a certain extent.

If the first character of a regular expression is begin, the regular expression is matched only when it appears at the beginning of a row, the beginning of a row refers to the start of the entire input string or the next line after a carriage return. Note that there is another complement for the complete statement. The two usages of the complete statement cannot conflict.

If the last character of a regular expression is $, It is matched only when the expression appears at the end of a row, the end of a line is a line break followed by a carriage return.

Operator/specifies whether a regular expression is matched depends on its suffix. For example, AB/cd matches AB only when AB is followed by cd. $ Is actually a special case of/. For example, the following two formal formulas are equivalent: AB $, AB/\ n

Whether a regular expression is matched or what action is executed after the match may also depend on the previous part of the expression. The processing method of the previously mentioned correlation will be discussed later, and the operator "<>" will be used"

7. Repeated and auxiliary Definitions

When the number pair is enclosed by {}, {} indicates repetition. When it is a name, it indicates expansion of the auxiliary definition. For example, a {} indicates the set {. aa. aaa. aaaa. aaaaa }. {digit} matches the pre-defined string named dight and inserts the defined string into the position where it appears in the regular expression. The auxiliary definition will be discussed later.

Finally, the symbol % serves as the inter-segment Separator of the lex source program.

1.4 actions in Lex source program

As mentioned above, when Lex identifies a word, it must complete the corresponding action. This section describes how Lex helps describe an action.

First, it should be noted that the strings in the input string that do not match the Regular Expression in any recognition rules will be copied to the output file as they are. Therefore, if you do not just want to copy the output, you must provide recognition rules for each possible word shape and provide corresponding actions in it. Pay special attention to the lexical analyzer used as a tool to write program languages. One of the simplest actions is to filter out some strings in the input, which is implemented using the null statement ";" in C.

For example, filter out all spaces, tabs, and carriage return line breaks in the input Shen. The corresponding recognition rules are as follows:

[\ T \ n];

If the actions of several adjacent rules are the same, you can use | to represent the action part. It indicates that the action of the rule is the same as that of the next rule. For example, you can write it as follows:

"" |

"\ T" |

"\ N ";

Note that double quotation marks in \ t and \ n can be removed.

The content of the external character array yytext is a string that is currently matched by a rule. For example, the regular [a-z] + matches all strings consisting of lowercase letters, you can use the following rules to know the specific string that matches it:

[A-z] + printf ("% s", yytext );

The action printf ("% s", yytext) is to print the content of the character array yytext, which is frequently used. Lex provides a macro ECHO to represent it, therefore, the above recognition rules can be written as follows:

[A-z] + ECHO;

Note that the default action mentioned above is to copy the input string to the output file as it is. What is the role of the above rule? This will be explained in the section "the ambiguity of Rules.

Sometimes it is necessary to know the number of characters in the matched string. The external variable yyleng indicates the number of characters in the current yytext. For example, to count the number of words and the number of characters in the input string (the word is assumed to be a string consisting of uppercase or lowercase letters), the following rules can be used:
[A-zA-Z] + {words ++;

Chars + = yyleng ;}

Note that the first and last characters of the matched string are

Yytext [0] And yytext [yyleng-1]

The following describes the C functions provided by Lex that may be used in writing operations.

L. yymore ()

When the string to be matched next time is added after the currently recognized string, that is, the existing content in yytext is not replaced by the next input but connected to its content, yymore () must be called in the current action ()

For example, assume that a language specifies that its string is enclosed in two double quotation marks. If a string contains double quotation marks, a backslash (\) is added before it \. It is not easy to use a regular expression to express the definition of the string. It is better to use the following concise Regular Expression and yymore () to identify it:

\ "[Delimiter"] * {
If (yytext [yyleng-1]
= '\' Yymore ();
Else
... Normal user processing
}

When the input string is "abc \" def ", the above rules first match the first five characters" abc \ ", and then call yymore () after "def" is added to the first part, note that the double quotation marks marked as the end of the string are processed by "normal user proessing ".

2. yyless (n)

If the end of the currently matched string needs to be re-processed, you can call yyless (n) to "return" the string to the input string, and then perform the matching next time. N in yyless (n) is the number of characters that are not returned, that is, the number of returned characters is yyleng-n.

For example, the string "=-a" in the C language has two meanings. Assume that you want to interpret it as "=-a" and give the information at the same time. The following identification rules can be used:

=-[A-zA-Z] {
Printf ("Operator (= -)
Ambiguous \ n ");
Yyless (yyleng-1 );
... Action for = -...
}

The above rule first prints a piece of information indicating the ambiguity, returns the letters after the operator to the input string, and finally processes the operator by "=. in addition, if you want to interpret "=-a" as "=-a", you only need to return the negative signs and letters to the input string for processing next time. Use the following rules:

=-[A-zA-Z] {
Printf ("Operator (= -)
Ambiguous \ n ");
Yyless (yyleng-1 );
... Action for =...
}
3. yywrap ()

When Lex processes the end of the input string, it automatically calls yywrap (). If the returned value of yywrap () is 1, Lex considers that the processing of the input is complete, if the value returned by yywrap () is 0, Lex considers a new input string to be waiting for processing.

Lex automatically provides a yywrap (), which always returns 1. If you want a yywrap () that returns 0 (), then, you can write a yywrap () in the "User Subroutine". It will replace the yywrap () automatically provided by Lex. In the ywrap () written by the user, you can also perform other actions that you want to do at the end of the input file, such as printing a table and outputting statistical results. The example of using yywrap () is provided below.

1. 5. Ambiguity of recognition rules

Sometimes a lex program may have more than one rule matching the same string. This is the ambiguity of the rule. In this case, Lex has two processing principles:

1) rules that match the maximum number of characters are preferred.

2) In rules that can match the same number of characters, the first rule is given first.

For example, two rules are provided in the following order:

Integer kegword action...
[A-Z] + identifier action...

If the input is integers, it will be processed as an identifier, because the rule integer can only match 7 characters, and [A-Z] + can match 8 characters; if the input string is an integer, it will be processed as a keyword, because both rules can match, but the rule integer is given first.

1.6 auxiliary definition in lex source program

The first part of the lex source program is the auxiliary definition. So far, we only involve how to write the second part, that is, how to write the recognition rule part. Now let's look at the first part. In the Lex source program, you need some auxiliary definitions for convenience. For example, a name represents a complex regular expression. The secondary definition must be generated before the first % and must be written from the first column. The syntax of the secondary definition is:

Name Translation

For example, the formal expression that uses the name IDENT to represent the identifier is defined

IDENT [a-zA-Z] [a-zA-Z0-9] *

The usage of the auxiliary definition in the recognition rule is to use the operator {} to enclose the name, and Lex automatically replaces it with the translation. For example, the usage of the auxiliary definition of the above identifier is:

{IDENT} action for identifer...

The following describes a Lex source program that identifies integers and real numbers in FORTRAN by means of auxiliary definition:

D [0-9]
E [DEde] [-+]? {D} +
%
{D} + printf ("integer ");
{D} + "." {D} * ({E })? |
{D} * "." {D} + ({E })? |
{D} + {E} printf ("real ");

Note that the preceding auxiliary definition can be used in the auxiliary definition section. For example, if D is used to define an EIP, the auxiliary definition used must have been defined in advance and the circular definition cannot appear. The above rules only describe the usage of the auxiliary definition, and do not recognize all the rules of numbers in FORTRAN, because it cannot process similar to 35. EQ. i, 35. EQ. 35 in I. E is treated as a real number. Please think about how to solve this problem.

In addition to the auxiliary definitions described above, you also need to use variables in the Lex source program, as well as some self-written subprograms. We have seen two common variables yytext and yylong. We have also introduced several subprograms yymore, yyless, and yywrap provided by Lex. Now we will introduce how to define variables and write subprograms.

Lex converts a user-written Lex source program into a C language program yylex. During the conversion process, Lex copies the user's own variable definitions and subroutines to yylex, lex stipulates that the contents of one of the following three situations will be copied;

1) lines starting with a space or tab. If they are not part of the lex recognition rules, they will be copied to the programs produced by lex. If such a row appears before the first %, its definition is global, that is, all functions in the program produced by Lex can use it. If such rows follow the first % But before all the identification rules, they are local and will be copied to the corresponding code involving its actions. Note that these lines must comply with the C language syntax and appear before all recognition rules.

A supplementary role of this provision is to enable users to provide a solution for the lexical analyzer generated by the lex source program or lex. Of course, annotations must comply with the C language syntax.

2) All the rows between the two rows '% {and %}' are copied wherever and whatever content appears, note that % {and %} must occupy one row separately. for example;

% {

# Defineendoffile 0
# Include "head. h"
Int flag
%}

The above measures are provided mainly because there are some rows in the C language. The macro definition in the above example or the row containing the file must be written from the first column.

3) any content that appears after the second % is copied regardless of the format.

1.7 How to Use Lex in UNIX systems assuming that a lex source program has been written. How can we obtain a lexical analyzer from a UNIX system?

Lex automatically converts the Lex source program into a C-language executable program, which is called Lex. YY. in the c file, the C language program can be run after being compiled by C.

For example, there is a lex source program named source. The first step is to convert it to Lex. yy. C using the following command:

$ Lex Source

($ Is a UNIX prompt ). Lex. yy. C. Compile the Code with the following command to obtain the executable target code.

Out:

$ CC Lex. yy. C-ll

The first 11 in the preceding command line calls the Lex library and is required. For details, refer to [1].

For more information about this section, see Lex (1) in [4)

Lex can be easily used with yacc, which will be introduced in the next chapter.

$1.8 example

This section provides two examples to illustrate how to write the Lex source program.

1. Add 3 integers in the input string that can be divisible by 7, and output the rest as is. first look at the Lex source program below:

%
Int K;
[0-9] + {
Scanf (-1, yytext, "% d", & K );
If (K % 7 = 0)
Printf ("% d", K + 3 );
Else
Printf ("% d", k );
}

There are still deficiencies in the above program, such as a negative integer, just adding the absolute value of it to 3, and such as X7, 49.63

The items are also modified, and the above source code is slightly modified below to avoid these problems.

%
Int k;
-? [0-9] + {
Scanf (-1, yytext, "% d", & k );
Printf ("% d", k % 7 = 0? K + 3; k );
}
-? [0-9] + ECHO;
[A-Za-z] [A-Za-z0-9] + ECHO;

2. The next example counts the number of words of different lengths in the input string. the statistical result is in the array lengs. The word is defined as a string consisting of letters. The source program is as follows;

Int lengs [100];
%
[A-z] + lengs [yyleng] ++;
• |
\ N;
%
Yywrap ()
{
Int I;
Printf ("Length No. words \ n ");
For (I = 0; I <100; I ++)
If (lengs [I]> 0)
Ptintf ("% 5d % 10d \ n", I, lengs [I];
Return (1 );
}

In the preceding streaming program, when Lex reads the input string, it only counts but does not output the result. After the input string is read, the statistical result is output when yywrap () is called, therefore, the user provides yywrap (). Note that the last statement of yywrap () is returned value 1.

1.9 process context relevance again

In $3, we introduced the representation of context relevance when using the regular expression of Lex. Here we will introduce in detail the context-related measures provided by lex. The problem to be dealt with is that some rules must take different actions in different contexts, or the same string has different interpretations in different contexts. For example, in a programming language, the same equal sign "=" is represented in the description section as an initial value for the variable. In this case, the action should be to modify the content of the symbol table; in the statement part, the equal sign is the value assignment number of the value assignment statement. In this case, the code corresponding to the value assignment statement should be generated. Therefore, it is necessary to judge its meaning based on the context of the EQUAL sign. Lex provides two main methods,

1) Use a flag to differentiate different contexts.

A flag is a user-defined variable. You can set different values for it in different contexts to distinguish the context in which the variable is located, in this way, the identification rule can determine the context of the Flag based on the current value and take corresponding actions.

For example, the input string is output as is, but for the word magic, when it appears in a row starting with the letter A, change it to first, change the number of rows starting with B to second, and change the number of rows starting with C to third.

The lex source program with flag is as follows;

Int flag;

%
Using a {flag = 'a'; ECHO ;}
Repeated B {flag = 'B'; ECHO ;}
C {flag = 'C'; ECHO ;}
\ N {flag = O; ECHO ;}
Magic {
Switch (flag)
{
Case 'A': printf ("first"); break;
Case 'B': printf ("second"); break;
Case 'C': printf ("third"); break;
Default; ECHO; break;
}
}

2) use the start condition to differentiate different contexts

In the Lex source program, you can define different start conditions by name. When a starting condition is placed before a recognition rule, this rule is used only when Lex is in this starting condition. Otherwise, this rule does not exist. The starting conditions of Lex can be changed at any time by the user program (that is, the Lex action.

The start condition is defined by the "Auxiliary definition part" of the Lex source program. The syntax is

% Start name1 name2 name3...

Here, Start can be abbreviated to S or s. The order of the names of the Start conditions can be given at will, and many Start conditions can also be defined by multiple % Start rows.

The use of the Start condition in the recognition rule is to enclose it with Angle brackets on the left of the regular expression of the recognition rule:
<Name1> expression

To enter the start condition, such as Name1, use a statement in the action

BEGIN name1

It changes the current start condition of Lex to name1.

To restore the normal state, use the statement

BEGIN 0

It restores Lex to the initial condition of the Lex Interpreter

A rule can also take effect under several starting conditions, as shown in figure

<Name1, name2, name3> rule

Make rule take effect under three different start conditions. To make a rule take effect under all the start conditions, no start conditions are attached before it.

For example: solution 1). The start condition is used this time. The Lex source code is as follows:

% Start AA BB CC
%
Pipeline a {ECHO; begin aa ;}
Listen B {ECHO; BEGIN BB ;}
∧ C {ECHO; begin cc ;}
\ N {ECHO; BEGIN 0 ;}
<AA> magic printf ("first ");
<BB> magic Printf ("second ");
<CC> magic Printf ("third") I

1.10 Lex source program format Summary

For convenience, the format of the Lex source program and the regular format of Lex are recorded here.

The general format of the Lex source program is:

{Definitions}

%

{Rules}

%

{User subroutines}

The auxiliary definition includes the following items;

1) Auxiliary definition. format:

Name translation

2) directly follow the copied code in the format:

Space Code

3) directly copy the Code in the following format:

% {
Code
%}

4) Start condition. format:

% S namel name2...

There are several other projects, which are skipped because they are not often used.

The format of the recognition rule section is

Expression action

The expression must be separated from the action by spaces. If there are more than one row of the action, enclose it in curly brackets.

The regular operators used by Lex include the following:

X characters x

"X" character X. If it is an operator, it does not work.

\ X same as above

[XY] character X or Y

[X-Z] character X, Y, or Z

[Character x] All characters except x

. All characters except carriage return

Limit x appears at the beginning of a row.

<Y> X when Lex is in the starting condition y, X

X $ appears at the end of a row.

X? Dispensable x

X * 0 or multiple X

X + 1 or more X

X | y X or Y

(X) character x

X/y characters X, but only when followed by Y

Expand the auxiliary definition XX of {XX}

X (m, n) m to N x

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.