DIY Development Compiler (ii) regular language and regular expressions

Last Update:2015-03-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From this one today, we will formally unveil the mystery of the compiler. First of all we are exposed to the module is the lexical analyzer, also known as the lexical scanner, the code I often call it scanner. Yesterday, I explained a little bit about why we need to separate the lexical analysis, and review this question today. Take a look at the following C # code:

String str = "Hello world";

Even without syntax highlighting, this code can be clearly divided into several sections. The first is the keyword string, followed by the variable name str, then the equals sign =, followed by a string literal constant "Hello world". Modern languages such as C # can clearly break the source code into fragments that have a definite meaning, which we call the morphemes (lexeme). Compared to the syntax for describing the entire C # language, we can describe different types of morphemes with relatively simple rules. For example, the words that appear in the above code are described in vernacular:

Type	Rules	Example
Keyword string	It's just s-t-r-i-n-g. These letters are made in sequence.	String
Identifier (variable name)	Starts with a letter, followed by 0 or more letters or numbers, but cannot conflict with the keyword	Str
Equals	one = Symbol	=
string literal constants	Start with double quotation marks, middle can contain any character that is not double quotes, end with double quotation marks	"Hello World"
Semicolon	A; a symbol.	;

We see that the different morphemes can be divided into several types according to their characteristics, and the next stage of the parsing can be entered directly by the type of the morphemes-what we call the word (token). Tokens are sometimes translated into tokens, marks, symbols, etc., which are collectively called words in this article. So visible, as long as the relatively concise rules, can be the original string composed of source files, decomposed into a stream of words, so that the next syntax can greatly simplify the analysis. This is the root cause of our separate lexical analysis as a module.

However, the rules listed in the table above are described in vernacular, and we want to be able to describe them in a formal language so that the computer can process them automatically. Regular Expressions are an ideal choice.

Everyone in the daily programming estimates are more or less exposed to regular expressions, using it to match strings, etc., may also be familiar with its syntax. But this time I want to re-introduce it from the most basic concept of regular expressions, and I want to make you understand it more deeply. First, we need to redefine the concept of "language". "Language" refers to a collection of strings in which characters come from a finite set of character sets. In other words, the language is always defined in a finite set of characters, but the language itself can be either a collection with a poor or an infinite set. For example, "C # language" means a collection of all strings that satisfy the C # syntax, which is obviously an infinite collection. Of course, you can also define some simple language, such as the language {a} has only one member, that is, a letter A. We all use curly braces {} to represent a collection of strings later. The so-called regular expression, is a special expression of a class of languages, there are 2 basic elements of regular expressions:

The expression ε represents a language that contains only a zero-length string, which can be understood as {String.Empty}, and we usually write String.Empty as ε, which is read as Epsilon.
For any character a in the character set, expression a represents the language of only one character a, which is {a}.

At the same time the regular expression defines 3 basic arithmetic rules:

Two regular expressions, recorded as x| Y, which represents the language that the regular expression x represents and the assembly of the language represented by the regular expression Y. For example, the language of a|b is {a, b}. Similar to addition
A connection of two regular expressions, denoted by XY, is the language of each string in the Y language that is appended to the language of X, and then the result of all such joins is a new language. For example, x = A|b,y = C|d, then the language represented by XY is {AC, BC, AD, BD}. Because x means {a, b}, and Y represents {c, D}, each string in the X language of the join operation is concatenated with each string in the Y language, and finally, 4 connection results are obtained. This is similar to multiplication
A regular expression of the clindamycin closure , recorded as X*, said to be 0 , one, two ... Infinite x is connected to itself, and then all of these are asked. Which means x* =ε|. X | XX | XXX | XXX | ......。 For example, the regular expression of A * represents an infinite language {ε, A, AA, AAA, AAAA, ...}. This is equivalent to repeating a language any time.

When the above three kinds of operations are written together, the priority of the clindamycin closure is higher than the join operation, and the join operation takes precedence over and operation. These are all the rules of regular expressions! It's not hard to understand, is it? Here we use regular expressions to describe the rules of the various morphemes just now.

The first is the keyword string, just as we described it as "just s-t-r-i-n-g these letters in order", with regular expressions to express, that is, s-t-r-i-n-g these letters of the connection operation, so write a regular expression is String. You will find this example very boring. So let's take a look at the next example: identifier. The vernacular is used to describe "the beginning of a letter, followed by 0 or more letters or numbers." A regular expression is used to describe "start with a letter", which means that it can start with any letter in a-Z. This is the bitwise operation in the regular expression:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z. If every regular expression is written like this, it's going to be crazy, so we introduce square brackets and write in square brackets to represent the characters and operations. For example, [ABC] means a|b|c. And A-Z a total of 26 letters we also abbreviated to A-Z, so that the "beginning of the letter" can be translated into regular expressions [a- z] . Next we translate the second sentence "can be followed by 0 or more letters or numbers," the sentence "0 or more" can be translated into the operation of the a-z0-9]*, and finally believe that we can write, that is [the ". Finally, there is a join operation between the two sentences, so the last regular expression that describes the identifier "language" is [a-z][a-z0-9]*. The * operation also means that an "identifier" is an infinite language with countless possible identifiers. That's the way it is, it's good to understand, right?

As can be seen from the above example, the regular expression can be combined with two elements and three basic operations. But if we're going to really take the rules that describe lexical words, we need some easy-to-use auxiliary syntax, like the square bracket syntax above. We define the extended operations of some regular expressions:

The square brackets denote the characters in parentheses and operate. [ABC] is equal to A|b|c
The square brackets begin with the ^ character, representing the character set, after all the characters in the square brackets are excluded, and the remaining characters are calculated. [^ab] means that all characters except AB are obtained.
a circle. A dot represents the and of all characters within a character set. So . * This expression can represent all the strings that this character set can make up.
X? represents x|ε. Represents an optional between x and an empty string.
x+ represents xx*. This is equivalent to limiting the X to repeat at least 1 times.

Students who have used regular expressions should be familiar with the above calculations. In fact. The regular expressions in net also provide more extended syntax, but we don't use them this time. NET, so the rest of the syntax is not listed.

We call all languages that can be represented in regular expressions as regular languages . Unfortunately, not all languages are regular languages. such as C #, or all programming languages, HTML, XML, JSON, and so on, are not regular languages. Therefore, you cannot use regular expressions to define the rules for the above language. However, it is very appropriate to use regular expressions to define the rules of lexical analysis. The morphemes of most programming languages can be expressed in a simple regular expression. The following is the definition of the regular expression for the above word.

Type	Regular expressions	Example
Keyword string	String	String
Identifier (variable name)	[A-z] [a-z0-9]*	Str
Equals	=	=
string literal constants	*"[^"]"**	"Hello World"
Semicolon	;	;

Regular expressions that are familiar to all of us are written in string form above. But this time we're going to handle the regular expression ourselves, and writing a string obviously adds to the difficulty of processing (parsing the regular expression string). So in the lexical analysis Library of the Vbf.compilers library, I introduced a method of using objects to represent regular expressions. I defined a regularexpression base class and then wrote a subclass for each of the regular expression features or operations:

Where Alternationexpression is the "and" operation, Concatenationexpression is the "connection" operation, emptyexpression of course represents the ε empty string, Kleenestarexpression represents the "Colin closure" operation (you can now tell that the symbolexpression can also be called the "Colin"-a Star) and a single character. Like Symbolexpression, it actually stores a character that it represents, and alternationexpression below stores two regularexpression instances to represent and compute the two sides. Therefore, any regular expression can be represented by an object tree of RegularExpression. For example, a regular expression [a|b]* can be expressed as:

RegularExpression re = new kleenestarexpression (new Alternationexpression (New             symbolexpression (' A '), new Symbolexpression (' B ')));

Kinda like LINQ to XML with wood? Although it's a little bit longer than a string (audience: It's much longer ...) ), but we do not need to parse the string to get its structure, which is very helpful for the next step. Well, I admit that it's too long to write, so I've defined some auxiliary static methods and operator overloads. The above regular expression can be written as:

var re = (re. Symbol (' a ') | RE. Symbol (' B ')). Many ();

Where RE is actually the alias to be declared with using Re=vbf.compilers.scanners.regularexpression; Even though it is longer than the regular expression of strings, take into account the convenience of not parsing strings. By the end of the grammar analysis, I will take you to develop the parser for the regular expression string.

The next question is, how do you use regular expression rules for lexical analysis? Regular expressions help us understand the rules of a word, but we cannot parse the string directly. To do this we will introduce the concept of a poor automaton to really handle the input string. Please look forward to the next article.

And don't forget to pay attention to the VBF project: Https://github.com/Ninputer/VBF and my Weibo: Http://weibo.com/ninputer Thank you for your support!

DIY Development Compiler (ii) regular language and regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

DIY Development Compiler (ii) regular language and regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

DIY Development Compiler (ii) regular language and regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support