Self-developed Compiler (2) regular language and Regular Expression

Source: Internet
Author: User

From today's article, we will officially unveil the mysteries of the compiler. The first module we come into contact with is the lexical analyzer, also known as the lexical Scanner. I often call it the Scanner in the code. Yesterday I explained why we need to separate lexical analysis. Let's review this question today. See the following section C # code:

 

 string str = "Hello World";

 

Even if there is no syntax highlighting, this code can be clearly divided into several parts. The first is the keyword string, followed by the variable name str, followed by equal sign =, followed by a string literal constant "Hello World ". Modern languages such as C # can clearly split the source code into fragments with clear meanings.Phoneme(Lexeme ). Compared with the syntax used to describe the entire C # language, we can use simple rules to describe different types of elements. For example, the plain words in the above Code are described in Vernacular:

 

Type

Rules

Example

String keyword

S-t-r-I-n-g.

String

Identifier (variable name)

It must start with a letter and can be followed by zero or multiple letters or numbers, but cannot conflict with keywords.

Str

Equal sign

One = symbol

=

String Literal constant

It starts with double quotation marks. It can contain any character that is not double quotation marks and ends with double quotation marks.

"Hello world"

Semicolon

One; symbol

;

 

We can see that different elements can be divided into several types based on their features, and the next syntax analysis phase can be directly based on the type of elements-we call itWord(Token) -- as input. Tokens are sometimes translated into tokens, tokens, and symbols. In this article, tokens are collectively called words. It can be seen that, with relatively simple rules, the source file consisting of the original string can be decomposed into a string of word streams, which can greatly simplify the subsequent syntax analysis. This is the root cause for separating lexical analysis as a module.

 

However, the rules listed in the above table are described in vernacular. We hope to use a formal language to describe the rules so that computers can automatically process them.Regular ExpressionIs an ideal choice.

 

In daily programming, we may be familiar with regular expressions to match strings. But I want to introduce it again from the most basic concepts of regular expressions, mainly to give you a deeper understanding of it. First, we need to redefine the concept of "language. "Language" refers to a collection of strings, where the characters come from a limited character set combination. That is to say, the language must always be defined on a finite character set, but the language itself can be either a collection of poor or infinite sets. For example, "C # Language" refers to a set of all strings that meet the C # syntax. It is obviously an infinite set. Of course, you can also define some simple languages. For example, the language {a} has only one member, that is, a letter. We use braces {} to represent the string set. The so-called regular expression is a special expression used to describe a type of language. The regular expression has two basic elements:

  1. ExpressionεA language that contains only one zero-length String. It can be understood as {String. Empty}. We usually write String. Empty as ε and read it as epsilon.
  2. Any character a and expression in the character setAIndicates the language with only one character a, that is, {}.

At the same time, regular expressions define three basic calculation rules:

  1. Of Two Regular ExpressionsAndIt is recorded as X | Y, indicating the union of the language represented by the regular expression X and that of the regular expression Y. For exampleA | BThe resulting language is {a, B }. Similar to Addition
  2. Of Two Regular ExpressionsConnectionIs used to connect each string in language X to each string in language Y, then, the results of all these connections form a new language. For example, if X = a | B, Y = c | d, XY represents {ac, bc, ad, bd }. Because X represents {a, B}, and Y represents {c, d}, concatenates each string in the X language and connects each string in the Y language, finally, four connection results are obtained. This is similar to multiplication.
  3. A regular expressionKelin Closure, As X *, indicating thatZero, One, two ...... The infinite number of X is connected to itself, and then all of these are obtained for convergence. That is to sayX * = ε | X | XX | XXX | .......For example, the regular expression a * represents an infinite language {ε, a, aa, aaa, aaaa ,.... }. This is equivalent to repeating a language at any time.

When the preceding three operations are written together, the priority of the closure is higher than that of the join operation, and that of the join operation is higher than that of the parallel operation. The above are all rules of the regular expression! It's not hard to understand, right? The following uses a regular expression to describe the rules of each element.

 

The first is the keyword string, which we described just now as "s-t-r-I-n-g in order", represented by a regular expression, that is the letter s-t-r-I-n-g.ConnectionTherefore, the regular expression isString. You will find this example boring .. Let's look at the next example: identifier. It is described in vernacular as "starting with a letter, followed by zero or multiple letters or numbers ". Use a regular expression to describe "starting with a letter", that is, it can start with any letter in a-z. This is in the Regular ExpressionAndOperation:A | B | c | d | e | f | g | h | I | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z. If every regular expression is written like this, it is crazy. So we introduce the square brackets syntax, which will represent the operators of these characters. For example, [abc] indicates a | B | c. A-z is also abbreviated as a-z, which can be translated into a regular expression starting with a letter.[A-z]. Next we can translate the "zero or multiple" in the sentence "you can follow zero or multiple letters or numbers"Kelin ClosureComputing, I believe everyone can write it, that is[A-z0-9] *. Finally, there is a join operation between the first and second sentences. Therefore, the regular expression that describes the identifier "language" is[A-z] [a-z0-9] *. The * operation also means that the "identifier" is an infinite language and there are countless possible identifiers. This is the case, right?

 

From the example above, we can see that regular expressions can be combined with two elements and three basic operations. However, if we want to describe the rules of Lexical Words, we need some easy-to-use auxiliary syntaxes, just like the square brackets syntax above. We define some regular expression extension operations:

  1. Square brackets indicate the characters in the brackets and calculate them. [Abc] is equal to a | B | c
  2. Starting with a ^ character in square brackets, it indicates that all characters in the character set are excluded and the remaining characters are calculated. [^ AB] indicates the union of all characters except AB.
  3. Circle. vertices indicate the sum of all characters in the character set. Therefore.*This expression represents all the strings that can be composed of such character sets.
  4. X? X | ε. Optional between X and a null string.
  5. X + indicates XX *. This is equal to the limit that X must be repeated at least once.

Those who have used regular expressions should be familiar with the above operations. In fact, the regular expressions in. NET also provide more extension syntaxes, but we do not use the. NET regular expression library this time, so we will not list the remaining syntaxes.

 

All the languages that can be expressed using regular expressions are calledRegular Language. Unfortunately, not all languages are regular languages. For example, C #, or all programming languages, HTML, XML, JSON, etc., are not regular languages. Therefore, regular expressions cannot be used to define the rules of the above language. However, it is very appropriate to use regular expressions to define lexical analysis rules. Most programming languages can be expressed using a simple regular expression. The following is the regular expression definition of the above words.

 

Type

Regular Expression

Example

String keyword

String

String

Identifier (variable name)

[A-z] [a-z0-9] *

Str

Equal sign

=

=

String Literal constant

"[^"] *"

"Hello world"

Semicolon

;

;

 

The regular expressions we are familiar with are written as strings. However, this time we have to handle regular expressions by ourselves. Writing strings obviously makes processing more difficult (to parse Regular Expression strings ). Therefore, in the lexical analysis library of the VBF. Compilers library, I introduced a method to express regular expressions using objects. I defined a RegularExpression base class and wrote a subclass for each regular expression element or operation:

Here, AlternationExpression is the "merge" operation, and ConcatenationExpression is the "join" operation. Of course, EmptyExpression represents the ε Null String, kleeneStarExpression indicates the "kilin closure" Operation (you can now know that the kilin closure can also be called the kilin star-originally a star) and the SymbolExpression that represents a single character. For example, SymbolExpression stores a character it represents, while AlternationExpression stores two RegularExpression instances to represent and calculate the two sides. Therefore, any regular expression can be represented by the RegularExpression object tree. For example, regular expressions[A | B] *It can be expressed:

 

RegularExpression re = new KleeneStarExpression(             new AlternationExpression(             new SymbolExpression('a'), new SymbolExpression('b')));             

 

Is it a bit like that of Linq to XML? Although it is written a little longer than a string (audience: A lot longer ......), However, we do not need to parse the string to obtain its structure, which is very helpful for the next step. Well, I admit that it's not enough to write all of them so long, so I defined some auxiliary static methods and operator overloading. The above regular expression can be written as follows:

 

var re = (RE.Symbol('a') | RE.Symbol('b')).Many();

 

Here, RE is actually the alias declared using the using RE = VBF. Compilers. Scanners. RegularExpression; statement. Although it is longer than the regular expression of the string, it can be tolerated because it does not require the convenience of parsing the string. After learning the syntax analysis, I will show you how to develop a regular expression string parser.

 

The next question is, how can we use regular expressions to represent rules for lexical analysis? Regular Expressions help us understand the rules of words, but they cannot be used to parse strings directly. To this end, we need to introduce the concept of finite automatic machines to truly process input strings. We look forward to the next article.

At the same time, do not forget to pay attention to VBF project: https://github.com/Ninputer/VBF and my microblog: http://weibo.com/ninputer thank you for your support!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.