YACC and Lex Quick Start

Source: Internet
Author: User
Tags lexer

YACC and Lex Quick Start

Introduction to Lex and YACC

Lex and YACC are two very important and powerful tools for UNIX. In fact, if you're familiar with Lex and YACC, their powerful capabilities make the compilers of FORTRAN and C as trifling as they are. Ashish Bansal gives you a detailed discussion of the two tools used to write your own language and compiler, including regular expressions, declarations, matching patterns, variables, YACC syntax, and parser code. Finally, he explains how to combine Lex and YACC.

5 reviews

Ashish Bansal ([email protected]), software engineer, Sapient Company

November 01, 2000

    • Content

Develop and deploy your next application on the IBM Bluemix cloud platform.

Get started with your trial

Lex stands for Lexical Analyzar. YACC represents yet another Compiler Compiler. Let's start with Lex.

Lex

Lex is a tool for generating scanners. A scanner is a program that recognizes lexical patterns in text. These lexical patterns (or regular expressions) are defined in a particular sentence structure, which we will discuss in a moment.

A matching regular expression may contain related actions. This action may also include returning a token. When Lex receives input in the form of a file or text, it attempts to match the text to a regular expression. It reads one input character at a time until a matching pattern is found. If a matching pattern can be found, Lex performs the associated action (possibly including returning a tag). On the other hand, if there are no regular expressions that can be matched, further processing will stop, and Lex will display an error message.

Lex and C are strongly coupled. A . Lex file (a lex file with an extension of . Lex ) is passed through the lex utility and generates a C output file. These files are compiled into the executable version of the lexical parser.

Back to top of page

General Expressions for Lex

A regular expression is a pattern description that uses a meta-language. An expression consists of symbols. Symbols are generally characters and numbers, but there are other tokens in Lex that have special meanings. The following two tables define some of the tags used in Lex and give a few typical examples.

Defining regular Expressions with Lex
character meaning
A-Z, 0-9, A-Z The characters and numbers that make up part of the pattern.
. matches any character except \ n.
- Used to specify a range. For example, A-Z refers to all characters from A to Z.
[ ] A character set. Matches any character within the parentheses. If the first character is ^ then it represents a negative pattern. For example: [AbC] matches any of a, B, and C.
* Match 0 or more of the above patterns.
+ Match 1 or more of the above patterns.
? Match 0 or 1 of the above patterns.
$ Matches the end of a line as the last character of the pattern.
{ } Indicates the number of times a pattern may occur. For example: a{1,3} indicates that A may occur 1 or 3 times.
\ Used to escape meta characters. It is also used to override the special meaning defined by the character in this table, taking only the character's intent.
^ Negative.
| The logical or between expressions.
"< some symbols >" The literal meaning of the character. Metacharacters has.
/ Match forward. If the "/" in the matched template is followed by a follow-up expression, only the part preceding "/" in the template is matched. For example: If you enter A01, the A0 in the template a0/1 is matched.
( ) Groups a series of regular expressions.
An example of a regular expression
Regular Expressions meaning
JOKE[RS] Match jokes or Joker.
a{1,2}shis+ Match Aashis, Ashis, Aashi, Ashi.
(A[b-e]) + Match 0 or 1 of all characters from B to E followed by a where A appears.

A token declaration in Lex resembles a variable name in C. Each tag has a related expression. (Examples of tags and expressions are given in the following table.) Using the examples in this table, we can compile a word counting program. Our first task is to explain how to declare a tag.

Example of token declaration
Mark Related Expressions meaning
Numbers (number) ([0-9]) + 1 or more digits
Character (chars) [A-za-z] Any character
Blank (blank) " " A space
Word (Word) (chars) + 1 or more chars
Variable (variable) (character) + (number) * (character) * (number) *

Back to top of page

Lex programming

Lex programming can be divided into three steps:

    1. Specifies pattern-related actions in a format that Lex can understand.
    2. Run Lex on this file to generate the C code for the scanner.
    3. Compile and link C code to generate an executable scanner.

Note: If the scanner is part of a parser developed with YACC, only the first and second steps are required. For help with this particular issue, read YACC and combine Lex and YACC in part.

Now let's look at a program format that Lex can understand. A LEX program is divided into three segments: the first is the global declaration of C and Lex, the second paragraph includes the pattern (C code), and the third is the Supplemental C function. For example, the third paragraph generally has the main () function. These paragraphs are delimited by percent. So, back to the word Count Lex program, let's look at the composition of different sections of the program.

Back to top of page

Global declarations of C and Lex

In this section we can add a C variable declaration. Here we will declare an integer variable for the word counting program to save the number of words that the program counts. We will also make a tag declaration of Lex.

Statement of the word count program
       %{        int wordCount = 0;        %}        chars [a-za-z\_\ ' \.\ "]        numbers ([0-9]) +        Delim [" "\n\t]        whitespace {delim}+        words {chars}+        %%

The two percent sign indicates the end of the section in the LEX program and the beginning of the second paragraph in paragraph three.

Back to top of page

Lex's Pattern matching rules

Let's take a look at the rules Lex describes the tags we want to match. (We will use C to define the action after the tag is matched.) Continue to see our word count program, the following are the rules for tag matching.

Lex rules in the word count program
       {words} {wordcount++;/*        Increase the word count by one*/        } {whitespace} {/* do        nothing*/}        {numbers} {/* One may        want to add some processing here*/} percent        

Back to top of page

C Code

The third paragraph of Lex programming, which is the last section of the function declaration (sometimes the main function) that covers C. Note that this paragraph must include the Yywrap () function. Lex has a set of functions and variables to use. One of them is yywrap. In general, Yywrap () is defined as the following example. We will explore this issue in advanced Lex.

C Code snippet for Word count program
       void Main ()        {        Yylex ();/* Start        the analysis*/printf ("No of words:        %d\n", wordCount);        }        int Yywrap ()        {        return 1;        }

In the previous section we discussed the basic elements of LEX programming, which will help you write simple lexical analysis programs. In this section of advanced Lex we will discuss the functions Lex provides so that you can write more complex programs.

Back to top of page

To combine them all together

The . Lex file is a scanner for Lex. It is represented in the LEX program as follows:

   $ Lex <file name.lex>

This generates the Lex.yy.c file, which can be compiled with the C compiler. It can also use parsers to generate executable programs, or pass options in link steps. ll contains the Lex library.

Here are some of Lex's flags:

    • -C is the default.
    • - T writes the LEX.YY.C program to replace the standard output.
    • - v provides a statistical summary of two rows.
    • - n does not print a rollup of-V.

Back to top of page

Advanced Lex

Lex has several functions and variables that provide different information that can be used to compile programs that implement complex functions. Some variables and functions are listed in the following table, along with their use. For a detailed list, refer to the Lex or Flex manual (see Resources below).

Lex variable The
Yyinfile* type. It points to the current file that Lexer is parsing.
Yyout The file* type. It points to the location where the lexer output is recorded. By default, both Yyin and yyout point to standard inputs and outputs.
Yytext The text of the matching pattern is stored in this variable (char*).
Yyleng Gives the length of the matching pattern.
Yylineno Provides the current row count information. (Lexer does not necessarily support.) )
Lex functions
Yylex () This function starts the analysis. It is generated automatically by Lex.
Yywrap () This function is called at the end of the file (or input). If the return value of the function is 1, it stops parsing. So it can be used to parse multiple files. The code can be written in the third paragraph, which can parse multiple files. The method is to use the Yyin file pointer (see the table above) to point to a different file until all the files are parsed. Finally, Yywrap () can return a representation of the end of the resolution.
yyless (int n) This function can be used to send back except for the former n? All read-out marks outside of the characters.
Yymore () This function tells Lexer to append the next tag to the current tag.

The discussion of Lex is here. Let's discuss YACC ...

Back to top of page

Yacc

YACC represents yet another Compiler Compiler. The GNU version of YACC is called Bison. It is a tool that translates all the syntax of any programming language into a YACC parser for that language. It is written using the Backus paradigm (BNF, Backus Naur Form). By convention, the YACC file has a. y suffix. The compile line calls the YACC compiler as follows:

       $ YACC <options>        <filename ending with .y>

Before we go further, let's review what grammar is. In the previous section, we saw Lex identify tokens from the input sequence. If you are viewing a sequence of tokens, you may want to perform an action when the sequence appears. The specification of a valid sequence in this case is called a syntax. The YACC syntax file includes this syntax specification. It also contains what you want to do when the sequence matches.

In order to clarify this concept, let us take English as an example. This set of marks may be: nouns, verbs, adjectives and so on. In order to use these markers to create a grammatically correct sentence, your structure must conform to certain rules. A simple sentence may be a noun + verb or noun + verb + noun. (e.g. I care. See Spot Run.)

So here in us, the tag itself comes from the language (Lex), and the tag sequence allows to specify these tokens with YACC (the tag sequence is also called syntax).

Terminal and non-terminal symbols

Terminal symbol : Represents a class of markup that is equivalent in syntax structure. There are three types of terminal symbols:

named tags : these are defined by the %token identifier. By convention, they are all capitalized.

character Tags : The word constants are written in the same way as C. For example,--is a character marker.

String token : The same as the string constant of C. For example, "<<" is a string tag.

Lexer returns a named tag.

non-terminal symbol : is a set of non-terminal symbols and terminal symbols composed of symbols. By convention, they are all lowercase. In the example, file is a non-terminal tag and NAME is a terminal tag.

Using YACC to create a compiler consists of four steps:

    1. Generate a parser by running YACC on a grammar file.
    2. Description Syntax:
      • Write a. Y grammar file (and also the action that C is going to take here).
      • Write a lexical parser to process the input and pass the token to the parser. This can be done using Lex.
      • Write a function that starts parsing by calling Yyparse ().
      • Write an error-handling routine (such as Yyerror ()).
    3. Compile YACC generated code and other related source files.
    4. Link the target file to the appropriate executable parser library.

Back to top of page

Writing syntax with YACC

Like Lex, a YACC program is divided into three segments using a double percent semicolon. They are: declarations, syntax rules, and C code. We will parse a file in the form of name = age as an example to illustrate the syntax rules. We assume that the file has multiple names and ages, which are separated by a space. As we look at each paragraph of the YACC program, we will write a grammar file for our example.

Back to top of page

Statement of C and YACC

The C declaration may define the types and variables used in the action, as well as the macros. You can also include header files. Each YACC declaration segment declares the names of terminal and non-terminal symbols (tokens), and may also describe operator precedence and data types for different symbols. Lexer (LEX) typically returns these tokens. All of these tokens must be described in the YACC declaration.

In the case of file parsing we are interested in these tags: name, equal sign, and age. Name is a value that consists entirely of characters. Age is a number. So the declaration segment would look like this:

Declaration of a File parsing example
       %        #typedef char* string;/* to        Specify token types as char* */        #define YYSTYPE string/*        a YACC variable W Hich has the value of returned token */        %}        %token NAME EQ Age percent        

You might think Yystype is a little strange. But like Lex, YACC also has a set of variables and functions that users can use to extend functionality. Yystype defines the type of yylval (another YACC variable) that is used to copy a value from Lexer to a parser or YACC. The default type is int. Because the string can be copied from lexer, the type is redefined as char*. For a detailed discussion of YACC variables, refer to the YACC manual (see Resources).

Back to top of page

YACC Grammar Rules

YACC syntax rules have the following general format:

       result:components {/* action to is        taken in C * *}        ;

In this example, result is a non-terminal symbol for the rule description. Components are different terminal and non-terminal symbols that are put together according to the rules. If a specific sequence is matched, components can follow the action to be performed. Consider the following example:

       Param:name EQ NAME {        printf ("\tname:%s\tvalue (NAME):%s\n", $1,$3);}            | NAME EQ value{            printf ("\tname:%s\tvalue (VALUE):%s\n", $1,$3);}        ;

If the sequence name EQ name in the example above is matched, the action in the corresponding {} brackets will be executed. Another useful thing here is the use of $ and $ $, which refers to the value of the tag name and name (or value of the second row). Lexer returns these values through the YACC variable yylval. The Lex code that marks NAME is this:

       Char [a-za-z]        name {char}+        %        {name} {yylval = StrDup (yytext);        return NAME; }

The rules section of the file parsing example is this:

Syntax for file parsing
       File:record file        | record        ;        Record:name EQ Age {        printf ("%s was now%s years old!!!", $ $);}        ;        %%

Back to top of page

Additional C Code

Now let's take a look at the last paragraph of the grammar file, with the C code attached. (This section is optional, if someone wants to skip it:) A function called the Yyparse () function (the Yylex () equivalent function of Lex in YACC). In general, YACC is best provided with code for the Yyerror (char msg) function. Called Yyerror (char msg) when the parser encounters an error. The error message is passed as a parameter. A simple Yyerror (char*) might be this:

       int Yyerror (char* msg)        {        printf ("Error:%s        encountered at line number:%d\n", MSG, Yylineno);        }

Yylineno provides the row count information.

This section also includes the main function of the file parsing example:

Additional C Code
       void Main ()        {            yyparse ();        }        int Yyerror (char* msg)        {        printf ("Error:%s        encountered \ n", msg);

To generate code, you might use the following command:

       $ YACC _d <filename.y>

This generates output files Y.tab.h and Y.TAB.C, which can be compiled with any standard C compiler on UNIX (such as GCC).

Back to top of page

Other common options for the command line
    • '-d ', '--defines ' : Write extra output files that contain these macro definitions: the tag type name defined in the syntax, the value type of the semantics Yystype, and some external variable declarations. If the parser output file is called ' name.c ', then the '-d ' file is called ' name.h '. If you want to place the Yylex definition in a separate source file, you need ' name.h ' because Yylex must be able to reference the tag type code and the yylval variable.
    • '-B file-prefix ', '--file-prefix=prefix ' : Specifies a prefix that can be used by all YACC output filenames. Select a name, as the input file is called ' prefix.c '.
    • '-o outfile ', '--output-file=outfile ' : Specifies the output file name of the parser file. Other output files are named according to the output file described by the '-d ' option.

The YACC library is usually included automatically in the compile step. But it can also be explicitly included to specify the- ly option in the compile step. The compile command line in this case is:

       $ cc <source file        names>-ly

Back to top of page

Combine Lex with YACC.

So far we have discussed Lex and YACC separately. Now let's take a look at how they're used together.

A program typically calls the Yylex () function Every time a token is returned. Terminates only if the file ends or an error mark occurs.

A parser generated by YACC calls the Yylex () function to get the token. Yylex () can be generated by Lex or written entirely by itself. For lexer generated by Lex, to be used in conjunction with YACC, a token must be returned whenever Lex matches a pattern. So the general format for Lex's action in matching patterns is:

       {Pattern} {/* do smthg*/        return token_name;}

The YACC will then get a return token. When YACC compiles a . y file with a _d tag, a header file is generated that has a #define definition for each tag. If Lex and YACC are used together, the header file must be included in the corresponding Lex file in the C declaration segment of Lex .

Let's go back to the file parsing example of name and age and take a look at the code for Lex and YACC files.

NAME.Y-Grammar file
       %        typedef char* string;        #define YYSTYPE string        %}        %token NAME EQ age        percent        File:record file        | record        ;        Record:name EQ Age {        printf ('%s is '%s years old!!! \ n ", $, $); }        ;        Percent of        int main ()        {        yyparse ();        return 0;        }        int Yyerror (char *msg)        {        printf ("Error        encountered:%s \ n", msg);        }
Parser file for Name.lex-lex
       %{        #include "y.tab.h"                #include <stdio.h>        #include <string.h>        extern char* yylval;        %}        Char [a-za-z]        num [0-9]        eq [=]        name {char}+ age        {num}+        percent        {name} {yylval = StrDup ( Yytext);        return NAME; }        {eq} {return eq;}        {Age} {yylval = StrDup (yytext);        return age; }        percent        int yywrap ()        {        return 1;        }

As a reference, we list the header files generated by y.tab.h, Yacc.

Y.TAB.H-YACC Generated header files
       # define NAME 257        # define EQ 258        # define Age 259

This concludes our discussion of Lex and YACC. What language do you want to compile today?

Resources
    • You can refer to the original English text on the DeveloperWorks global site in this article.
    • Lex and Yacc, Levine, Mason and Branson, O? Reilly and its co-operative company, 2nd Ed.
    • program development in UNIX, J. T. Shen, Prentice-hall India.
    • Compilers:principles, Techniques and Tools, Ahoo, Sethi and Ullman, Addison-wesley Pub. Co., 1985, 11.
    • Lex and YACC and compiler writing guide.
    • Java version of Lex guidance, called Jlex.
    • Use formalizing a grammar instances of Lex and YACC.
Article comments

Please login or register to post a comment.

Add Comments:

Note: HTML syntax is not supported in comments

chars [a-za-z\_ ' \. "] The regular expression, as if the author had a clerical error. A-za-z ....

Published by like Hill on July 11, 2014

Report Abuse

(A[b-e]) + matches 0 or 1 of all characters from B to E followed by a where A appears.
The above explanation is not right,
Repeat at least 1 times for the total of 1 of all characters from B to E followed by a where A appears.

Published by like Hill on July 11, 2014

Report Abuse

Dude, your regular regular expressions are a little problematic.
a{1,2}shis+ matches Aashis, Ashis, Aashi, Ashi.
Above is s+, must end has s or more s, can not appear Ashi,aashi bar

Published by like Hill on July 11, 2014

Report Abuse

"Name.lex and Name.y do not pass under Ubuntu12.10 and need to add # include <stdio.h> and #define Yystype char* in the Name.lex first paragraph to pass.

Published by Qinxinjianpo on August 13, 2013

Report Abuse

Nice article!

YACC and Lex Quick Start

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.