Language Service series for vsx Development (6) -- Lex and YACC

Source: Internet
Author: User

 

Preface

When we are familiar with how to build our own language service, the rest of the problem is the real completion of scanning and resolution, any excellent language service is indispensable for excellent ScanningProgramAnd the parser. There are many ways to write scanning and parsing programs. I think Lex and YACC are commonly used, otherwise, Microsoft will not implement a C #-based Lex and YACC (I am referring to mplex and mppg ). As the saying goes, we will learn about Lex and YACC. Part of this article is taken from: http://www.ibm.com/developerworks/cn/linux/sdk/lex/

 

Introduction to Lex and YACC

Lex stands for lexical analyzar, and YACC stands for yet another compiler. Therefore, they are not some language. Lex and YACC are two very important and powerful tools for Unix. In fact, if you are familiar with Lex and YACC, their powerful functions make it as fun to create Fortran and C compilers. In Linux, flex replaces Lex, while bison replaces YACC. mplex and mppg are used in LS design. We wroteSource codeIt is called "with Lex and YACC StyleCode".

 

Lex Overview

Lex is a scanner generation tool. When lex receives a file or text input, it tries to match the text with a regular expression. It reads one input character at a time until a matching pattern is found. If a matching pattern can be found, Lex executes the relevant action (or a flag may be returned ).

Regular Expression of lex

A regular expression is a schema description that uses the metalanguage. An expression is composed of symbols. Symbols are generally characters and numbers, but there are some other symbols with special meanings in lex, which is similar to regular expressions. The following two tables define some tags used in lex and provide several typical examples.

Use Lex Define Regular Expressions

Character

Description

A-Z, 0-9, A-z

It consists of characters and numbers in some modes.

.

Match any character except \ n.

-

Used to specify the range. For example, a A-Z refers to all characters from A to Z.

[]

A character set. MatchArbitraryCharacter. If the first character is^Then it indicates the negative mode. For example, [ABC] matches a, B, and any one of C.

*

Match0ItemsOr multiple of the above modes.

+

Match1ItemsOr multiple of the above modes.

?

Match0Or1ItemsThe above mode.

$

Match the end of a row as the last character of the pattern.

{}

Indicates the number of times a mode may appear. For example, a {1, 3} indicates that a may appear once or three times.

\

Used to convert metacharacters. It is also used to overwrite the special meaning defined in this table, and only take the intention of the character.

^

No.

|

The logic or between expressions.

"<Some symbols>"

The literal meaning of a character. Metacharacters.

/

Forward match. If "/" in the matched template is followed by a subsequent expression, only the part before "/" in the template is matched. For example, if A01 is input, a0 in the template A0/1 matches.

()

Group a series of regular expressions.

 

Examples of Regular Expressions

Regular Expression

Description

Joke [RS]

Match jokes or Joker.

A {1, 2} shis +

Match aashis, ashis, aashi, and ASHI.

(A [B-E]) +

Match 0 or 1 of all characters from B to E that follow after a appears.

Tags in lex declare variable names similar to those in C. Each tag has a related expression. (The following table provides examples of tags and expressions .)

Markup declaration example

Mark

Related Expressions

Description

Number)

([0-9]) +

One or more numbers

Character (chars)

[A-Za-Z]

Any character

Space (blank)

""

A space

Word)

(Chars) +

One or moreChars

Variable)

(Character) + (number) * (character) * (number )*

 
Lex Programming

Now let's take a look at the program format that Lex can understand. A Lex program is divided into three sections: the first section is the global declaration of C and Lex, the second section includes the mode (C code), and the third section is the supplementary C function. These segments are separated by %. So let's take a look at the previous Lex code.

 
% Using Babel; % Using Babel. parser; % namespace Babel. lexer % [0-9] + {return (INT) tokens. number;} [A-Z] + {return (INT) tokens. lowletter;} [A-Z] + {return (INT) tokens. capletter ;}.; % /*.... */

The code is divided into three parts by two groups of %. The first part of the namespace declaration replaces the header file declaration of c, which consists of a regular expression and C # code, indicates that an action is executed when a pattern matches. The last part is the supplementary code of C #, which has no content here. We can also use the tag declaration to modify the code:

% Using Babel; % Using Babel. parser; % namespace Babel. lexernumber [0-9] lowletter [A-Z] capletter [A-Z] anycharacter [. \ n] % {number} + {return (INT) tokens. number;} {lowletter} + {return (INT) tokens. lowletter;} {capletter} + {return (INT) tokens. capletter ;}{ anycharacter };%% /*.... */

As you can see, I used something similar to a variable and declared some patterns in the first code. These "variables" are called Tag declarations. In section 2, you can use {} to reference these tag declarations. In addition, in the declaration part of the first section, you can use % {and %} to divide a C # code segment, where the code can be C # code, mplex copies this part of the code to the target file without parsing or modifying it. In the first section, it can declare some States that can be used by the scanner, in advanced applications, the usage status is almost unavoidable.

 

YACC Overview

It is a tool that combines anyProgramming LanguageAll syntaxes are translated into the YACC Parser for this language. It is written in the BNF (Backus Naur Form) paradigm. By convention, the YACC file has a. y suffix, which is also recognized by mppg as A. Y suffix.

Related Concepts

Syntax

Before proceeding, consider what syntax is. In the previous section, we can see that Lex identifies tags from input sequences. If you are viewing the tag sequence, you may want to execute an action when the sequence appears. In this case, the specification of the valid sequence is called syntax. The YACC syntax file includes this syntax specification. It also contains what you want to do in sequence matching. To clarify this concept, let's take English as an example. This set of tags may be nouns, verbs, adjectives, and so on. To use these tags to create a sentence with correct syntax, your structure must comply with certain rules. A simple sentence may be a noun + verb or noun + Verb + noun. (For example, I care. See spot run .). So here, the tag itself comes from the language (lex), and the tag sequence allows YACC to specify these tags (the tag sequence is also called syntax ).

Terminator

Represents a class of equivalent tags in the syntax structure. There are three types of terminologies:

Naming tags: these are defined by % token identifiers. By convention, they are all in upper case.

Character Mark: character constants are written in the same way as C. For example, -- is a character mark.

String Tag: it is written in the same way as the String constant of C. For example, "<" is a string tag.

Lex returns the name tag.

Non-Terminator

A group of non-Terminator and terminator symbols. By convention, they are all in lower case.

YACC Programming

Like lex, a yacc program also uses double semicolons to divide it into three sections. They are declaration, syntax rules, and C code.

Review our previous Code

% Using Microsoft. visual Studio. textmanager. interOP % namespace Babel. parser % valuetype lexvalue % partial/* % except CT 5 */% Union {Public String STR; }%{ errorhandler handler = NULL; Public void sethandler (errorhandler hdlr) {handler = hdlr;} internal void callhdlr (string MSG, lexlocation Val) {handler. adderror (MSG, Val. slin, Val. scol, Val. ecol-Val. scol) ;}%}% token Number % token capletter % token lowletter % Program: declarations; declarations: declarations declaration | Declaration; Declaration: Number | capletter | lowletter; %

In the above Code, the first part of the declaration contains a lot of content and is not explained for the time being. Let's just look at the section similar to % token number, which is also called the mark declaration, the difference is that the tag declaration here is a tag declaration shared by Lex and YACC. The tag returned by Lex must have been declared here! Note that in the Lex code, the return flag is declared here. The second part is the syntax statement. Whether you understand the bacos paradigm or not, you should be able to feel like a "tree ". The top layer is a program, which will extend many branches and regenerate branches. The final leaves are often the tags defined in the first part.

Declaration: Number | capletter | lowletter;

Looking at the above expression, this expression indicates that a declaration can be composed of a number, a capletter, or a lowletter. Here, number, capletter, and lowletter are Terminator, and Declaration is called non-Terminator. It can be seen that a non-terminator can contain a terminator or a non-Terminator. In addition, the second part can define some actions in a sequence matching like the second part of Lex. No actions are defined here, in our real application, functions such as parentheses matching and error checking are implemented by defining actions here.

 

Combination of lex and YACC

The key to the combination of the two is % token. In the parser. y output file parser. CS, the tags of these definitions are compiled into the tokens type enumeration:

 
Public Enum tokens {error = 1, EOF = 2, number = 3, capletter = 4, lowletter = 5 };

If the enumerated values are 1 and 2, they are the default two tags. Our custom tags start from 3. The error mark is often used in the default error handling situation. We can also use it as a terminator in the "syntax tree" to make our language service more robust, and provides special error handling methods.

I strongly recommend that you use only the name tag when developing the language service, instead of the character tag and string tag.

The cooperation between lex and YACC is also reflected in other places, and readers can understand it in the future.

 

Summary

This article introduces Lex and YACC, and describes the basic mode and syntax of lex and YACC Programming Based on the code in the previous example. Microsoft's mplex and mppg have almost implemented the complete Lex and YACC, but some of them are slightly extended. The corresponding documents can be found in C: \ Program Files \ Microsoft Visual Studio 2008 SDK \ visualstudiointegration \ extradocumentation path. I am trying to translate the documents. In a later chapter, I will post the Translation results. These documents are required to have a good understanding of mplex and mppg. In addition, I suggest beginners develop C-based Lex and YACC based on some materials, and start language service development after learning Lex and YACC. My blog post builds the lex and YACC Debugging Environments in visual studio2008 to help you build programming environments. Readers can also try to read the lex and yacc of managedmc. Some of the expressions are worth learning.

Other Lex and YACC resources:

Http://www.ibm.com/developerworks/cn/linux/sdk/lex/

Xiong chunlei: Lex and YACC from entry to mastery

O 'Reilly Lex and YACC Version 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.