LEX and YACC use (ii)

Source: Internet
Author: User
Tags character set error handling generator

second, the use of the YACC of the automatic generator of the grammar analysis program

2.L YACC Overview

Formal languages have a strictly defined grammatical structure, and when we deal with them we first have to analyze their grammatical structure. Yace is an auto-generator of a parser, and strictly speaking, Lex is also the automatic generator of a formal language parser. But the language that Lex can handle is limited to normal language, and the lexical structure of a high-level language can be expressed in regular notation, so Lex is just a generator of lexical analysis programs. Yace can handle context-independent languages that can be expressed in LALR (1) grammars. And we will see that Yace has some of the two semantic functions of solving grammar.

YACC is widely used, but it is mainly applied to the automatic construction of programming language compiler program. For example, the portable C language compiler is written in Yacc. There are many database query languages that are implemented with YACC. Therefore, YACC is also called "Compiler Compiler Program (" A Compiler Compiler ").

The working diagram of YACC is as follows;


Figure 2.1 YACC schematic diagram

In Figure 2.1, the "YACC source program" is a syntax description of the language that the user writes in a BNF-like language that is written in YACC. YACC will automatically convert the source program to a parser using the LR method for parsing the parser yyparse, like Lex, the host language of YACC is also C, so Yyparse is a C-language program that is parsed by the user in the main program by invoking Yyparse.

Parsing must be based on lexical analysis, so the resulting parser needs a lexical analysis program to work with it. Yyparse requires that the lexical analysis program be named Yylex. When users write Yylex, they can use Lex. Since the name of the lexical parser that Lex produces is exactly yylex, it is convenient for Lex to use it with YACC, which will be described in detail in the 2.5 2.5.3, and note that the lexical parser can also be included in the YACC source program.

In addition to the grammatical rules in the YACC source program, it also includes the semantic actions to be completed when these grammatical rules are recognized, i.e., the semantic action is a program segment written in C language. The output of the parsing can be a syntax tree, a generated target code, or information about whether the input string conforms to the syntax. What kind of output is required is implemented by the semantic actions and program sections of the program section.

The following section describes how the YACC source program is written and the relevant commands for using YACC in UNIX systems.

2.2 General format of YACC source program

A YACC source program generally consists of three parts: the description section, the Grammatical Rules section, and the sections of the program section, which are grouped in sequence in the following format:

Description section

%%

Grammar rules section

%%

section of the program segment

The above three parts of the Description section and the section of the procedure are unnecessary, when there is no part of the program section, the second

A single percent can also be omitted. But the first percent is a must. The composition and wording of each part are described in detail below.

2.3 YACC Source Program Description part of the wording

YACC the Description section of the source program defines the finalization symbols to be used in the grammar rules, the data classes used in the semantic actions

Type, variables, union types of semantic values, and precedence of operators in grammar rules. How these content is organized, such as

Under

%{
Header File Table
Macro definition
Data type definition
global variable definition
%}
Syntax start character definition
Semantic value type definition
Terminator definition
operator precedence and definition of binding

2.3.1 Header File Table

YACC directly copy this part of the definition into the generated C language program Y.TAB.C, so to press C language

Grammar rules to write. The header File table is an # include statement for a series of C languages, starting with the first column of each row,

For example:


%{
# include
# include
# include
$include "Header.h"
%}
.
.
.
%}

2.3.2 Macro Definition

This section uses the C language's # define statement to define the macros to be used in the program. For example


%{
.
.
.
# define EOF O
#dffine Max (x, y) ((x>y). X:Y)
.
.
.
%}


2.3.3 Data type definition

This section defines the type of data to be used in the semantic action or part of the program segment, for example:


%{
.
.
.
typedef struct interval{
Double lo, hi;
}interval;
.
.
.
%}


2.3.4 global variable definition

The global variables to be used in the external variables (external variable) and YACC source programs are defined in this section, for example:


%{
.
.
.
extern int nfg;
Douhle dreg[26];
INTERVAL vreg[26];
.
.
.
%}


Another type declaration for non-integral functions is included in this section, see 2.6 Example 2.

Again, the contents of the above four parts between%{and%} are copied from YACC to Y.TAB.C, so they must be fully compliant with the C language grammar, and in addition, the bounds of%{and%} are best each exclusive line, that is best not written
%{int x;%}

2.3.5 Syntax start character definition

The start sign of the context-independent grammar is a special non-terminator, all derivation starts with this non-Terminator, and in Yacc, the syntax start character definition statement is:

% start non-terminator ...

Without the above instructions, YACC automatically takes the non-terminator of the first syntax rule in the Grammar Rules section as the syntax start character.

2.3.6 semantic value type definition

YYCC generated by the parser Yyparse is the LR analysis method, which in the grammatical analysis, in addition to a state money, there is a semantic value, storing its analysis of the non-and the semantic values of the Terminator, some of these semantic values from the lexical analysis program returned, some in the semantic action of the assignment, These are explained in detail when introducing semantic actions. If there is no definition for the type of the semantic value, then YACC considers it to be integral (int), that is, if all the syntax symbols are assigned semantic values, they must be integer, otherwise they will be typed incorrectly, but the user will often expect a complex type of semantic value, such as a double-precision floating-point number, a string, or a pointer to You can then describe the definition of the semantic value type. Because the semantic value types of different syntax symbols may be different, the semantic value type description is defined as the type of the semantic value as a union (union), which includes all possible types (one for each member name), so that the user does not have to indicate the member name each time it accesses the semantic value. The semantic value type definition section also requires the user to describe which union member type each syntax symbol (terminator and non-Terminator) is a semantic value. See below for an example of 2.6 example 2.

% union{
int ival
Double Dval
INTERVAL Vval;

%token Dreg VREG
%token CONST
%type Dexp
%type Vexp
...


In the above definition, the line starting with%union defines the union type of the semantic value, a total of three member types are named Ival, Dval, Vval.

The rows starting with%token are defined as terminator (see 2.3.7) so Dreg,vreg and Const are terminator, and the names in the angle brackets are the concrete types of these terminator semantic values. The semantic values of the two terminator, such as Dreg and Vreg, will be integer (int) and the member name is ival.

The line starting with%type is the type that describes the non-terminating semantics value. If the semantic value of the non-Terminator dexp will be a double-precision floating-point type, note that Terminator in YACC does not have to be specifically declared, but when the description section has a definition of the semantic value type, and a non-terminator semantics value will be accessed, it must be defined with the above method.

2.3.7 Terminator Definition

All terminator (except the literal character literal) that appear in the YACC source program Syntax section must be defined in this section, which defines the following example:

% Token DIGIT Letter

Each terminator definition line begins with%token, noting that there is no space between the% and token, and that multiple terminator can be defined in a row, separated by a space, and the Terminator name can consist of letters, numbers, and underscores, but must be in the letter header. The constituent rules for non-terminator names are the same. Terminator can have more than one row defined.

YACC stipulates that each terminator has a unique number. When we define the Terminator in the way above, the numbering of the numbers is determined by the internal YACC, and its number sequence is incremented from 257, plus 1 each time. However, this rule does not apply to the Terminator of the literal character (literal). For example, in the following syntax rule, ' + ', '; ' is the literal character Terminator:

Stats:stats '; ' stat;
expr:expr ' + ' expr;

The literal character terminator is enclosed in single quotation marks when it appears in the rule. They do not need to be defined with the%token statement, and the YACC numbers take the value of the character in its character set (such as ASCII). Note that the semicolon at the end of the two syntax rule above is a punctuation mark in the Yacc meta-language, not a literal character Terminator.

YACC also allows users to define their own terminator numbers. If so, then the format of the Terminator definition is:

%token Terminator Name Integer

where "Terminator name" is the terminator to be defined, "integer" is the number of the Terminator, and each such row defines a terminator. Note that different terminator numbers cannot be the same. For example


%token BEGIN 100
%token END 101
%token IF 105
%token then 200
...


In 3.6 We said that if a user defines a type of semantic value, then those with meaningful semantic values whose semantic value type is terminator by the member name in union, in addition to the definition method described in paragraph 3.6, the definition of terminator and the type description of its semantic value can be separated, for example:


%token Dreg VREG CONST
%type Dreg VREG
%type CONST


2.3.8 operator precedence and definition of binding

Take a look at the following syntax for expressions:


%token NAME
expr:expr ' + ' expr
|expr ' – ' Expr
|expr ' * ' expr
|name
;


This grammar has two semantics, such as sentence: A+b-c, can be interpreted as (A + B) A C can also be translated into A + (B-C), although both interpretations are reasonable but caused two of the semantic, if the sentence

A+b*c

interpreted as (a+b) *c is semantically wrong.

YACC allows the user to specify the precedence and binding of the operators, thus eliminating the two semantics of the above grammars. For example, the rule ' + '-' has the same priority, and is all left-associative, so. A+b-c is only interpreted as (A + B) and C. Then the priority of ' * ' is greater than ' + '-', then A + b* C is correctly interpreted as a + (B*C), so the correct form of the above grammar should be:


%token NAME
%left ' + '-'
%left ' * '
%%
expr:expr ' + ' expr
|expr '-' expr
|expr ' * ' expr
|name
;


The line that begins with%left in the Description section is the line that defines the binding of the operator. %left indicates that subsequent operators follow left-associative,%right represents right-binding, and%NONASSOC indicates that subsequent operators are not binding. The precedence is implied, in the Description section, the operator in the preceding row is lower than the operator of the following line, and the operators in the same row have the same precedence, so in the above grammars, the ' + ' and ' one ' precedence are the same, and their precedence is less than ' * ', and the three characters are left-associative.

A unary operator is sometimes used in an expression, and it may be the same symbol as a two-tuple operator, such as the unary operator minus "-" is the same as the minus sign '-', and the precedence of the unary operator should be higher than the corresponding two-operator precedence. Should at least be the same as the priority of ' * ', which can be defined using YACC's%PREC clause, see the following syntax:


%token NAME
%left '-' + '
%left ' * '/'
%%
expr;expr ' + ' expr
|expr ' + ' expr
|expr '-' expr
|expr ' * ' expr
|expr '/' expr
|' -' Expr%prec ' * '
|name
;

In the above grammar, to make the priority of a unary '-' equal to ' * ', we use the clause

%prec ' * '

It shows that the rightmost operator or Terminator in the grammar rule it is in has the same precedence as the symbol after%PREC, noting that the%PREC clause must appear before the end of a grammar rule, and the%PREC clause does not change the priority of '-' as a two-tuple operator.

The eight definitions described above can be omitted without the necessary parts.

2.4. How to YACC the grammatical rules part of the source program

The grammatical rules section is the core part of the YACC source program, which defines the syntax of the language to be processed and the semantic actions to be used. The following is a description of the grammar rules of the writing format, semantic action and YACC to solve the two semantic and conflict specific measures. Finally, error handling is described.

The writing format of 2.4.1 grammar rules

Each grammar rule consists of a left and a right, separated by a colon ': ' Between the left and the next, with a semicolon "at the end of the rule"; tag, so the format of a grammar rule is as follows:

Nonterminal:body;

Or

Nonterminal:body

Where NonTerminal is a non-terminator, the right body is a string composed of terminator and non-terminator, can be empty, see a few examples:

Stat:while Bexp do Stat
;
Stat:if Bexp then stat
;
stat:/* empty*/
;

The right part of the third syntax rule above is empty, and the portion enclosed in '/* ' and '/* ' is annotated. You can put the same grammatical rules together as the left non-terminator, with the short-term ' | ' between rules Delimited, followed by a semicolon after the last rule, for example:

Stat:while Bexp do stat
| IF Bexp then stat
|/* empty*/
;

There are several suggestions for the writing of the Grammatical Rules section;

1. Use a lowercase string to denote non-terminator, and use the uppercase string to denote the terminator.

2. Focus the same production on the left, as in the previous example.

3. The right part of each rule is aligned as much as possible, for example, starting at the first tab. Write in this style YACC source program is clearly readable and easy to modify and check errors.

4. If the resulting formula (grammar rules) requires recursion, use the left hand method as much as possible. For example:

Seq:item
| Seq ', ' Item
;

Because the left-hand method allows the parser to be normalized as early as possible without causing the state stack to overflow.

2.4.2 Semantic actions

When the parser recognizes a sentence pattern, it uses the corresponding grammatical rules for the attribution, YSCC the user-provided semantic action before the attribution, these semantic actions can be the semantic values that return the grammatical symbols, or the semantic values of some grammatical symbols, or other appropriate actions such as building a syntax tree, Generate the target of the agent, print the information and so on. The semantic value of Terminator is returned by the lexical parser, which is brought back by the global variable (YACC auto-defined) Yylval, which is automatically used as the semantic value of the Yylval if the user assigns the corresponding value to the Terminator when the lexical parser recognizes the terminator. When the type of the semantic value is not int, be aware that the type of the Yylval value must be the same as the semantic value type of the corresponding terminator. Semantic actions are written in the C language, followed by the corresponding grammatical rules, enclosed in curly braces. For example:

A: ' (' B ') '
{Hello (L, "abc");}
XXX:YYY ZZZ
{printf ("a message\n");
flag=25;

:


To access the semantic value of a syntactic symbol, the user uses a pseudo-variable that begins with $ in the semantic action, which is provided internally by the YACC and is not defined by the user. The pseudo-variable $$ represents the semantic value of the left non-terminator, and the semantic values of the grammatical symbols on the right side of the production are $1,$ 2 in order from left to right. For example, in the following production formula:

A:B C D
;

The semantic values for a are $$,b, C, and D, followed by $1,$2,$3.
To illustrate the role of pseudo-variables, consider the following example: There is a production

Expr: ' (' expr ') '
;

The value of expr on the left side should be equal to the value of the right-connected expr, indicating that the semantic action of this requirement is,
Expr: ' (' expr ') '
{$$=$2;}
;

If the pseudo-variable $$ is not assigned in the semantic action after the production, the YACC automatically puts it into the value of the first grammatical symbol (that is, $ $) in the right part of the production, which often requires the insertion of semantic actions between the grammatical symbols in the right part of the resulting type. This means that the parser can complete these actions when it identifies the part of the sentence. Take a look at the following example:

A:b

{$$=1;}

C
{x=$2; y=$3;}

The value of x in the example is finally 1 and the value of Y is the semantic value of the symbolic C, note that the semantic action after B $$=1 does not set the semantic value of symbol A to 1, because the example above is implemented as follows.

$act:/*empty. /
{$$=1;}
;
A:b$actc
{x=$2;y=$3;}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.