Developing the compiler's Thompson structure with Java: Lexical parsing of regular expressions

Last Update:2016-04-29 Source: Internet

Author: User

Tags closure

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

    Thompson构造:正则表达式的词法解析

Hello everyone, Welcome to coding Disney, read blog friends can go to my netease cloud classroom, through the video way to see the code debugging and execution process:
http://study.163.com/course/courseMain.htm?courseId=1002830012

Following the previous section we developed the closure replacement function, this section continues to advance the development of the Thompson Construction algorithm. Our goal is to convert a set of regular expressions into an NFA finite state automaton. Whether they are regular expressions or the final finite state automata, their essence is to determine the input text. For example, regular expressions:
{d}+
({d}+| D . {d}+| {d}+. {D}) (e{d}+)

Used to determine whether the input is a shaped or floating-point string. Its corresponding finite state automata:

It is also used to determine shaping and floating-point strings. They form different, but the content is the same. Let's start with a simple compiler that we started with. Converts an arithmetic expression to a computer assembler pseudo-code. In the same way, arithmetic expressions are different from pseudo-code, but they are essentially representations of how numbers are calculated. This kind of form of information, converted to another form of expression, although the form is different, but the same information content behavior, is not compiled. Thus, the process of converting regular expressions to finite state automata is actually a compilation process. The program that we are going to implement is actually a compiler.

As we mentioned earlier, the compiler's schema allows you to understand the compiler's running process. Basically is through the lexical parser, the content to be compiled read, lexical parsing will read into the content into a specific sub-section, that is, to tag. The parser interprets the input content through the lexical parser. When a particular tag is encountered, take a specific reading operation. As we are currently developing a compiler, it is also necessary to follow a similar process. The focus of this section is to develop a lexical parser for regular expressions.

The characteristics of regular expressions:
In the previous chapters, we have done an analysis of regular expressions, and here we review them. A regular expression is actually a set of expressions consisting of a combination of ordinary characters and special characters. Special characters have special meanings, in regular expressions, special characters are:

? { } [ ] ( ) . ^ $ "\

When interpreting regular expressions, we know that ordinary characters match their ASCII characters, such as the character a matches the ASCII character ' a ' that corresponds to it. But when common characters are combined with special characters, we need to interpret them differently, for example [a-z] matches all lowercase letters. When combined with an escape character, the ordinary letters also need to be interpreted differently, for example:

\b represents backspace, which is equivalent to a keyboard delete
\ n = line break
\ r = Enter
\s Represents a space
\e indicates the ESC key of the keyboard
\DDD D represents a number, and \DDD represents a three-bit octal number
\XDDD represents three-bit hexadecimal number
\^c a backslash, an upper angle bracket, a C code any letter, he represents the keyboard key CTRL plus the corresponding letter combination
\c an escape character plus any of the characters to match the character itself. For example. Matches the character.
If there is no slash,. is a wildcard character in a regular expression. * Indicates the match character ', if there is no slash, the closure operation is represented in the regular expression. That is, the special character is preceded by an escape character, and it no longer has a special meaning.

There are special symbols, such as ^ for beginning match, ^[a-z] to match any string that begins with lowercase characters, [a-z]$ matches any string that ends with a small letter. {Represents the beginning of a macro definition} represents the end of a macro definition.

Any special characters in the double quotation mark "" no longer have a special meaning, such as the regular expression "+?" just match three characters + ?.

Lexical parser implementation: (Pick out Eclipse)
Based on these conclusions, we can begin to design lexical parsers. To regular expression is, lexical parsing is relatively simple, we only need to read one character at a time, and return the character corresponding to the label on it. In the code,
We give special tags to each special character, and for ordinary characters, we unify to a label called L, which indicates that Literal is a character constant, and in code, the enum class defines the tag value of the special character:

Since we are dealing with characters that are basically ASCII characters, there are only 128 characters in total. We use a length of 128 byte array to hold each character corresponding to the label, in the code this array is called Tokenmap, the lexical parser reads a character, obtains the ASCII code value corresponding to the character, according to the ASCII code value in Tokenmap to obtain the corresponding tag value, for example, If the symbol is entered.
, "." The corresponding ASCII code value is 62, so tokenmap[62] Gets the value of any in the enum Token.
Let's look at the initialization code for TOKENMAP:

At the time of construction, the parser first assigns the initial value token.l to each element of the Tokenmap, and then sets the corresponding label for the special character.

The advance interface of the Lexer class is used to read the characters and then return the label for the character:

In the advance function, we first get the regular expression string that has been processed from the Exprhandler. The characters are then extracted sequentially from the obtained string, parsed individually, and charindex is used to represent subscripts in the regular expression string where the current parsing character resides. EOS indicates that the string for the current regular expression has been parsed, and the new regular expression string is read from Exprhandler when the next time you enter advance, when the current label is found to be EOS.

Then look down:

When the characters we read are double quotes, we need to mark them because all the characters in the double quotes, whether ordinary or special, we treat them as ordinary characters. If you read the escape character, you need to handle it accordingly. The function that handles the escape character is Handleesc, at the end of the advance function, we see whether the currently parsed character is in double quotes or the character is escaped, and any character followed by the escape character or the character in double quotation marks is treated as a normal character, if this is not the case , the label for the character is found in the Tokenmap.

Let's take a look at the special handling function for the escape character Handleesc, if the regular expression [\b] is parsed, the character \ and character B are interpreted together, and the function that interprets them is Handleesc:

You can see that \ and B will be interpreted as ' \b ', and ' \b ' is represented in the ASCII table as the "delete" key on the keyboard. Keep looking down:

If the expression encounters a string \^a, as we mentioned earlier, this represents CTRL + A on the keyboard, and in the code comments, I give how to convert similar characters. The next code deals with interpreting the \XDDD as a three-digit hexadecimal number. In a later debug demo, we will understand in detail how this code works.

In main function Main, execute runlexerexample, you can debug the function of the lexical parser:

In Runlexerexample, the lexical parser reads the regular expression from the console, parses the character of the expression one after the other, and displays the label meaning of the character to the console, for example, when the input regular expression macro is replaced by [0-9]+, Then the runlexerexample will output the following results:

In the next section, we will perform a debugging demonstration of the code. Let us have a further understanding of the principle of the implementation of the word Law parser.

Developing the compiler's Thompson structure with Java: Lexical parsing of regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More