DIY Development Compilers (iv) using the DFA conversion table to create a scanner

Source: Internet
Author: User
Tags lexer new set

On the last we introduced two kinds of models with poor automata-deterministic finite automata and non-deterministic automata, and the final conversion from regular expression through NFA to DFA algorithm. Some students say it is still difficult to understand how the NFA converted to DFA. So at the beginning of this article I would like to cite one more example to see what happens after the NFA is converted to DFA. First we look at the following NFA, which is converted from a set of regular expressions used in lexical analysis. The NFA incorporates the four-word NFA of If, ID, NUM, and error. As a result, its four acceptance states represent four different words that were encountered.

Using the last learned method, we need to find a DFA, each of which is a subset of the NFA state set. First we want to define any state of the ε closure, the reason is called ε closure, because it is closed to the ε conversion, that is, the ε closure of any state, after ε conversion, is still a state within the closure. Next, starting with the initial state ε closure, we want to calculate the next state set that the NFA can convert to after entering any one of the characters. The formula for this step is:

One of the symbols of the U-shape means: For any State s in the NFA state set D, find the set of all the states that the s can achieve after encountering the symbol C, and then the set of all such sets. Finally, we find the ε closure for this set. It's hard to find a simpler way to describe, in short, a new set of all the states that the NFA state set D, after entering the symbol C, can achieve. And this set becomes a state of the DFA, which is achieved from D, along a side labeled C. We first find the initial state of the ε closure as the initial state of the DFA, and then we will repeat from the currently known set of NFA states to calculate the new state set that can be reached after entering any character, until the new NFA state collection can no longer be found. This algorithm is indeed a little bit of test thinking ability, so we suggest that you draw a few simple NFA, according to the formula in the previous one, think more, it will be understandable. Below I posted the above NFA converted into the DFA, let everyone to the NFA turned into the DFA have a perceptual understanding:

As you can see, each state of the converted DFA is a set of several original NFA states. And any state set, where one is the receiving state of the NFA, we will accept it as the DFA. Note that some states may contain more than one NFA acceptance state. For example, the status of the accept if is the NFA's state set {3,6,7,8}, where number 3rd is the status of the accept if in the NFA, and number 8th is the state of the NFA that accepts the ID. So why do we choose to have the DFA state accept if instead of the ID, because if is the keyword, id is the identifier, we must prioritize if the priority is higher than the ID, or we cannot parse the IF keyword at the time of lexical parsing. That is, in the design lexical analysis of its time we want to have all the reserved keyword priority above the ID, which is reflected in the selection of the DFA acceptance state.

Once the NFA->DFA conversion has been completed, the DFA state does not need to retain the original NFA state set of information, we can completely abstract the DFA into a table, where the table is one row of the DFA, and each column is a character. This is the first DFA we introduced in the previous article:

This DFA is written in the form of a state transition table, which is:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z "
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

This table may be a little bit different from what we think, let's look at this: first, this table is more than one state 0 in the DFA state diagram, and this state, regardless of the input character, goes to state 0. We call this state of shutdown . In practice, we use state 0 as the shutdown state. The meaning of the shutdown state is that once the state machine reaches this state it "dies" and it can no longer leave the state. Wait a minute we'll see why we need this shutdown state, first look at State 1. In this state, when the input character is "(quotation mark), it will be converted to state 2, which is consistent with the DFA state diagram above, and if the input character is a-Z, all go to state 0, that is, the shutdown state. This means that in state 1, accepting these (A-Z) characters is undefined and will cause the DFA to die. Next is the state 2, which should be the same as everyone expected, a-Z character input will go back to state 2, and the input quotes will go to State 3. The last is State 3, because state 3 does not emit any edges, so any input characters on state 3 will cause downtime. Finally, we look back at why we need downtime because we need it to determine if a word is detected. The steps for lexical analysis using the DFA State conversion table are:

    1. At first, let the DFA be in Status 1 (not 0, remember!) )。
    2. Enter a character for the string, and check the next state in the table that corresponds to the character.
    3. Transition to the next state. Also, if the state is not 0, remember the state with another variable.
    4. The input string is continuously entered for state transitions until the current state is 0(down state).
    5. Check that another variable remembers the previous state, and if the previous state is accepted, the report successfully scans a word, or if it is not accepted, reports a lexical error.
    6. If you want to continue parsing, you need to restore the status of the DFA to 1 (instead of 0) and start over again.

That is, we always have to wait until the DFA is running to the shutdown state, that is, after the death to determine whether to successfully sweep anchor to a word, because we want the lexical analyzer to make the longest match . For example, we parse C # code:string1 = null; The string1 in this code is an identifier that represents a variable. If the lexical scanner has just scanned a string, the report finds the keyword string, and the logic here is wrong. If you wait until the DFA state is switched to the stop state, then you can judge the longest possible word. For example, when the parser parses a string, it still has no downtime, and then enters the next character "1", at which point the state of the lexical analysis is converted from the state of the "keyword string" to the state that accepts the "identifier", and then the lexical analyzer sends a character that is a space. The space is not followed by any legal word behind string1, so it goes to the shutdown state. Finally, we judged that the last state before the outage was to accept the "identifier" status, so the report successfully scanned the identifier string1. This allows for the longest match to be achieved.

In the VBF.Compilers.Scanners library, I used a two-dimensional array to store the DFA State transition table. One of the FiniteAutomationEngine.cs contains tasks that store the DFA conversion table and perform a state transition operation. Finally, the scanner class implements the real lexical analysis logic. If you are interested in the algorithm described in the above language, you can directly see the implementation of these two classes.

Next we need to consider a very practical question. If the lexical analyzer is to support parsing of strings on the Unicode character set (UTF-16), the DFA conversion table can be very large. In fact, if you want to support comments, strings, or identifiers in Chinese, the DFA conversion table will have more than 40,000 columns and can have up to 65536 columns. This will take up the memory of sizeof (int) x 65536= 256KB as long as one state is taken. In languages like C #, a DFA can be up to hundreds of states. Even if we want to do the C # ultra-small subset Minisharp, there are also 140 states. This way the light DFA State conversion table will occupy 35MB of memory. Although the computer now often has 8G of memory, but the CPU level two or three cache is usually only a few MB, if the DFA conversion table can not be put into the two-level cache, the efficiency must be greatly affected. We look at the DFA State conversion table listed above and find that the columns of these characters from a-Z are exactly the same, all in state 1 go to state 0, in state 2 to state 2, and in state 3 go to state 0. We call this the same equivalence class as the exact character of the column of the conversion table. If our DFA State conversion table uses the equivalent class instead of a character as a column, the volume of the State conversion table can be greatly reduced. Then, we can use the time complexity of O (1) to map any character to its equivalence class by using the mapping table of the equivalence class, which is a character. For example, after applying an equivalence class, the DFA shown above can become:

Equivalence class table:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z "
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Conversion table:

0 1
0 0 0
1 0 2
2 2 3
3 0 0

In this way, only one row of the equivalent class table is required. And because there are at most only 65,536 equivalence classes (one equivalence class per character), this equivalent class table can be declared as ushort[65536], which only accounts for 128K of memory. After compression, there are only 57 equivalence classes for the Minisharp language. So 140 states need less than 32KB of memory can be installed, now it can be fully loaded into the CPU's level two cache, perfect to achieve the goal!

In the Vbf.compilers class library, in order to nfa-> the DFA algorithm efficient, even in the NFA, the equivalent class table is computed. Of course, the NFA phase is calculated without the accuracy of the DFA phase, so the equivalence class table is computed again after converting to DFA. This so-called double compression method reduces the number of hours required to process a large number of Unicode character set NFA conversions to hundreds of milliseconds.

Next, let's take a brief look at the usage of the VBF.Compilers.Scanners library. First, to write your own lexical parser, you need to refer to the VBF.Compilers.Common.dll library and

VBF.Compilers.Scanners.dll. Where the common library contains a class that stores compilation errors, and an important class: Sourcereader. This class can take any textreader as input, and it also supports counting the rows and columns of the current source code during the reading process. Therefore, the lexical parser relies on this class for source code input. To define a lexical parser, you need one of the most basic classes-theLexicon class. This class is equivalent to a dictionary, which preserves the definition of all words, while internally making regular expressions to DFA conversions. The following code demonstrates the use of the Lexicon class:

Using RE = VBF.Compilers.Scanners.RegularExpression;
...
Lexicon Lexicon = new Lexicon (); Lexerstate lexer = Lexicon. Defaultlexer; Token IF = lexer. Definetoken (RE. Literal ("if")); Token ELSE = lexer. Definetoken (RE. Literal ("Else")); Token ID = lexer. Definetoken (RE. Range (' A ', ' Z '). Concat (    RE. Range (' A ', ' Z ') | RE. Range (' 0 ', ' 9 '). Many ())); Token NUM = lexer. Definetoken (RE. Range (' 0 ', ' 9 '). Many1 ()); Token whitespace = lexer. Definetoken (RE. Symbol ('). Many ()); Scannerinfo info = lexicon. Createscannerinfo ();

Let's go through the code on a row-by-line basis. First the lexicon class can be created directly with new, without any parameters. The following is the code:

Lexerstate lexer = Lexicon. Defaultlexer;

This line of code calls the Defaultlexer property of lexicon and returns a Lexerstate object that represents the overall state of a lexical parser. We are going to use this object to define the regular expression of the word. By default, Defaultlexer is the only lexerstate, and you do not have to create a new Lexerstate object. But if we need to make certain morphemes appear as different types in different contexts, we can define new lexerstate. For example, the word "get" is usually an identifier, and in the context of defining a property, it becomes a keyword. Lexerstate allows derived child states to support this scenario. But for the time being, we should consider only the defaultlexer situation.

You can use the Definetoken method to define words after you get defaultlexer. Definetoken accepts a RegularExpression object as a parameter. The static method of the RegularExpression class (abbreviated as RE in code) can represent the basic operation of regular expressions and several commonly used extension operations. The following table lists common uses for regularexpression:

Usage of the RegularExpression class Example Regular expression represented by
| Operator
Union method
x | Y
X.union (y)
X|y
>> operators
Concat method
X >> y
X.concat (y)
Xy
Many method X.many () x*
Many1 method X.many1 () x+
Optional method X.optional () X?
Range static method RE. Range (' 0 ', ' 9 ') [0-9]
CharSet static method RE. CharSet ("abc") [ABC] (and operation)
Literal static method RE. Literal ("abc") ABC (Connection operation)
Repeat method X.repeat (5) Xxxxx
Charsof static method RE. charsof (c + = c = = ' a ') [A] (creates and sets a set of characters based on a lambda expression)
Symbol static method RE. Symbol (' a ') A

You can look at the above code, combined with this table to learn the various uses of RegularExpression. Note that the order in which tokens are defined determines the priority of each word, which precedes it more first. To ensure that the priority of the keyword is reserved, all keywords must be defined before the identifier ID. After all the words have been defined, we call Lexicon's Createscannerinfo method to get a scannerinfo object. This object contains the parameters required for the converted DFA and the various lexical analyzers to function. Next, we can create the scanner object with the Scannerinfo object, see the following code:

scanner Scanner = new Scanner (info); string Source =" asdf04a 1107 Else "; StringReader sr = new StringReader (source); scanner. SetSource (New Sourcereader (SR)); scanner. Setskiptokens (whitespace. Index); Lexeme L1 = scanner. Read (); Console.WriteLine (L1. Tokenindex); equals ID.IndexConsole.WriteLine (L1. Value); equals Asdf04alexeme L2 = scanner. Read (); Console.WriteLine (L2. Tokenindex); Equals NUM.IndexConsole.WriteLine (L2. Value); equals 1107Lexeme L3 = scanner. Read (); Console.WriteLine (L3. Tokenindex); Equals ELSE.IndexConsole.WriteLine (L3. Value); equals Elselexeme L4 = scanner. Read (); Console.WriteLine (L4. Tokenindex); equals info. Endofstreamtokenindexconsole.writeline (L4. Value); equals null 

When you create a scanner object, you need to pass in the Scannerinfo object that you built in the previous step, and you can specify the source code for the input. Here we use StringReader to read a string of source code. Note that the Setskiptokens method of scanner can set words that the lexical scanner automatically skips. For example, we do not want the lexical analyzer to return whitespace characters of the morphemes, set to skip whitespace words. When manipulating the scanner class, all token-related operations are done through Token.index (an integer), because scanner internally only stores the index value of token within lexicon, which reduces memory usage and improves efficiency.

After everything is ready, you can call scanner. The Read () method is used for lexical analysis! Call scanner every time. Read () returns the next morpheme (Lexeme object), and from the properties of the lexeme we can get the word type (still in the form of Token.index integer), the string representation (Value property) of the morphemes, and the rich information about the position of the morphemes in the source code. Scanner when scanning to a file or a trailing end of a string. Read () returns a special morpheme that represents the end of the Stream. The tokenindex of this special morphemes can be queried from the Scannerinfo object (each Scannerinfo endofstreamtokenindex will be different). You can try to run the above code and modify your lexical definition or source code to observe the various behaviors of the scanner class. In addition, the VBF.Compilers.Scanner library also provides two special capabilities scanner--, Peekablescanner and Forkablescanner, which we will use in future chapters.

So far, we have fully discussed the various techniques and VBF required for lexical analysis. In the next article we will discuss the lexical definition of minisharp language and really implement Minisharp's lexical analyzer. You can then learn how to create a regular expression that supports Chinese identifiers and annotations. Please look forward to!

Also don't forget to follow my VBF project: Https://github.com/Ninputer/VBF and my Weibo: Http://weibo.com/ninputer Thank you for your support!

DIY Development Compilers (iv) using the DFA conversion table to create a scanner

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.