Use Lex and YACC to compile code, Part 1: Introduction

Last Update:2018-12-03 Source: Internet

Author: User

Tags lexer token identifier

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Most people never need to know what Lex and YACC can do. To compile and download some things, you sometimes need to install them. However, in most cases, it is very easy to use. Some readme files may occasionally mention the "shift/reduce" conflict. In any case, these tools are still a valuable part of the UNIX toolkit and a little understanding of them will be helpful.

In fact, although Lex and YACC are almost always mentioned at the same time, they can be used independently. There are a lot of interesting small programs that are fully compiled with Lex (see the links in references ). Applications that use YACC instead of lex are rare.

The names "Lex" and "YACC" represent the GNU flex and bison versions of these tools. The Code should be applicable to all major versions, such as mks YACC. It is a great family!

This series has two parts. The first article will introduce Lex and YACC using more general terms to study what they can do and how to do it. The second article shows a real application compiled using them.

Lex and YACC are paired tools. Lex splits the file into a group of tokens, which are generally similar to words. YACC accepts group marks and assembles them into high-level structures, similar to sentences. YACC is designed to process Lex output, but you can write your own code to complete this task. Similarly, the output of lex is designed to provide data for a certain type of parser.

They are used to read files in better structured formats. For example, you can use Lex and YACC to read code in many programming languages. You can also use them to read many data files with fully predictable formats. Lex and YACC can be used to parse quite simple and regular syntaxes. Natural languages are beyond their range, but most computer programming languages are within their range.

Lex and YACC are tool for program compilation. Their output is code and must be provided to the compiler. Generally, additional user code is added to use the code generated by Lex and/or YACC. Some simple programs can not rely on any other Code; in some larger and more complex programs, the parser is only a small part.

It is better to study each program in more detail.

The word analyzer is not a small invention that can be seen in Sci-Fi presentations. It is a program that splits input into recognized fragments. For example, a simple vocabulary analyzer may count the input words. Lex can accept the specification file and build a corresponding vocabulary analyzer (written in C ).

The best way to understand it is to study an example. The following is a simple Lex program from the flex manual.

Int num_lines = 0, num_chars = 0;
%
/N ++ num_lines; ++ num_chars;
. ++ Num_chars;
%
Main (){
Yylex ();
Printf ("# Of lines = % d, # of chars = % d/N", num_lines, num_chars );
}

This program has three parts.%%Separated by symbols. The first and last parts are common and ancient C code. The middle is an interesting part. It consists of a series of rules that Lex translates into a vocabulary analyzer. Each rule contains a regular expression and the code to run when the regular expression is matched. Any text that is not matched is simply copied to the standard output. Therefore, if your program attempts to parse a language, it is important to ensure that all possible input can be captured by lexer; otherwise, the missing content will be displayed to the user like a message.

In fact, the code in Listing 1 is a complete program. If you use Lex to run it, compile it, and run the results, it will do what it looks like.

In this case, it is easy to see what happened. Line breaks always match the first rule. Any other character matches the second rule. Lex tries every rule in sequence to match the longest input stream as much as possible. If some content does not match any rules at all, Lex will just copy it to the standard output; this behavior is generally undesirable. A simple solution is to add a rule that can match any content at the end. This rule neither does anything (if you are too lazy) nor sends any type of diagnostic information. Note that Lex gives priority to longer matching, even if they are more backward. Therefore, if there are these rules:

u { printf("You!/n"); } uu { printf("W!/n"); }

With "uuu" as the input, Lex will first match the second rule, process the first two letters, and then match the first rule. However, if some content can match any of the two rules, the order in the Lex specification determines which rule to use. When a rule cannot be matched, some versions of lex will warn you.

What makes Lex practical and interesting is that it can handle more complex rules. For example, the rules for identifying C identifiers may be similar to the following:

[a-zA-Z_][0-9a-zA-Z_]* { return IDENTIFIER; }

The syntax used is a common and ancient regular expression syntax. There are some extensions. One extension is that you can name a common schema. You can first%%Previously, some of them were defined as names:

DIGIT [0-9] ALPHA [a-zA-Z] ALNUM [0-9a-zA-Z] IDENT [0-9a-zA-Z_]

Then, you can place the names in the Rule section in braces to reference them in reverse order:

({ALPHA}|_){IDENT}* { return IDENTIFIER; }

Each rule has the corresponding code to be executed when the regular expression is matched. The code can complete any necessary processing and return a value optional. If there is a parser that uses lex output, this value will be used. For example, in the simple example of a row counting program, the parser is not required. If you want to use a parser to explain the code in some languages, you should return some content to the parser to tell it what mark you get. You can only useenumOr a series#defineCommand, or you can have YACC generate a list of predefined values for you.

By default, Lex reads data from standard input. You can easily point it to another file; reading from the buffer is a little difficult. There is no fully standardized method for this; the easiest way is to open a temporary file, write data to this temporary file, and then hand it over to lexer. The following is a sample code for completing this task:

int
doparse(char *s) {
        char buf[15];
        int fd;
        if (!s) {
                return 0;
        }
        strcpy(buf, "/tmp/lex.XXXXX");
        fd = mkstemp(buf);
        if (fd < 0) {
                fprintf(stderr, "couldn't create temporary file: %s/n",
                        strerror(errno));
                return 0;
        }
        unlink(buf);
        write(fd, s, strlen(s));
        lseek(fd, 0, SEEK_SET);
        yyin = fdopen(fd, "r");
        yylex();
 fclose(yyin);
}

This code is used to automatically clear a temporary file by disconnecting it (making it open but deleted. More careful programmers, or programmers who are not writing programs for readers who only have limited space for example code, may consider addingTMPDIR.

In this way, you have split the input into a series of tokens. Now you need some methods to identify high-level patterns. This is what YACC has to do: YACC allows you to describe how to process the mark. The YACC syntax is similar to the following:

value:
    VARIABLE
  | NUMBER
expression:
    value '+' value
  | value '-' value

This means that the expression can be any of several formats;
For example, a variable, a plus sign, or a number can all be an expression. Pipeline characters (|) Indicates available.
The symbol generated by lexer is calledTerminals)OrTokens).
The content assembled from them is calledNon-terminals).
Therefore, in this example,NUMBERIs a terminator; lexer generates this result.
On the contrary,valueIs a non-Terminator, which is created by assembling the Terminator.

Similar to Lex files, YACC files are also used%%The components separated by the flag. Like the Lex file, the YACC file is composed of three parts (the last part is optional), and its content is just a common C code to be integrated into the generated file.

YACC can recognize the mark pattern. For example, as shown in the above example, it can identify an expression that may consist of one value, one plus sign, one minus sign, and another value. It can also take actions; when the parser reaches the condition point in the expression, it is encapsulated in{}The code block in will be executed. For example, someone may write:

Expression:
Value' + 'value {printf ("matched a' + 'Expression./N ");}

The first part of the YACC file defines the objects to be processed and generated by the parser. In some cases, it can be null, but more often, it should contain at least some%tokenCommand. These commands are used to define the tokens that lexer can return. When using-dWhen running YACC, it will generate a header file that defines constants.

({ALPHA}|_){IDENT}* { return IDENTIFIER; }

In this way, the preceding example may contain the YACC syntax of the following line:

%token IDENTIFIER

YACC will create a header file containing a line similar to the following (the default name isy.tab.h):

#define IDENTIFIER 257

These numbers will be out of the valid character range; in this way, lexer can return a single character as is, or return a mark that uses these defined values. This can cause problems when porting code from one system to another: Normally, you 'd better re-run Lex and YACC on the new platform instead of porting the generated code.

By default, the parser generated by YACC first attempts to parse the instance of the first rule obtained in the Rule section of the file. You can use%startTo specify a different rule to change this line. However, it is generally most reasonable to arrange the top-level rule on this part.

The following question is how to deal with the components of the expression. Generally, the solution to this problem is to define a data type that can accommodate the objects to be processed by YACC. This data type is a CunionObject, used in the first part of the YACC File%unionDeclaration. After the mark is defined, you can specify a type for them. For example, for a toy-level programming language, you can do the following:

% Union {
Long value;
}
% Token <value> Number
% Type <value> Expression

This indicates that when the parser returnsNUMBERIt can be considered as a global variable.yylvalNamevalueHas been granted a meaningful value. Of course, your lexer must handle it in some way:

[0-9] + {
Yylval. value = strtol (yytext, 0, 10 );
Return number;
}

YACC allows you to reference the components of an expression by symbol name. When a non-Terminator is parsed, the component that enters the parser is named$1,$2, And so on; it will return the value name to the high-level parser$$. For example:

expression:NUMBER '+' NUMBER { $ = $1 + $3; }

Note that the literal plus sign is$2There is no meaningful value, but it still occupies a position. No need to specify a "return" or any other content: just assign a wonderful name$$.%typeThe statement indicatesexpressionNon-terminator also uses UnionvalueMember.

In some cases%unionMultiple types of objects in the Declaration may be helpful; in this case, you must ensure that%typeAnd%tokenThe declared type is the ones you actually use. For example, if you have%unionDeclaration, and you declare a mark to use the pointer type value, but your lexer is assigned with its integer value ...... An exception may occur.

Of course, this does not solve the final problem: the non-terminator expression at the beginning can return nothing, so you should process the value it generates. One method is to ensure that all the work to be done is completed at any time; the other method is to build a separate large object (for example, a list of linked entries ), assign a pointer to the global variable at the end of the first rule. Therefore, for example, if you want to compile the above expression parser into a common calculator, you only need to write the parser to get a program that parses the expression very carefully, then we don't need to deal with them any more. Theoretically this is very interesting and artistic, but it is not very practical.

This basic introduction will allow you to study Lex and YACC on your own; you may use the expressions it parses to constructIndeedA simple calculator that can do something. If you do a little research, you will find that Lex and YACC can be used for problem diagnosis. In the next phase, we will study problem diagnosis technology. In addition, we will build a larger and more powerful parser that can complete real tasks.

From: http://www.ibm.com/developerworks/cn/linux/l-lexyac.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More