The last time we talked about using regular expressions to represent word rules in lexical analysis. Regular Expression rules are easy to understand, but regular expressions cannot be used directly to parse strings. We also need to introduce a model suitable for conversion into computer programs. The model we introduced today is calledFinite Automaton(Finite automation,FAIs also called a finite state machine ). A finite automatic machine first contains a finiteStatusIs also included inConversion. A finite automatic machine looks like a directed graph. The State is the node of the graph, and the state conversion is the edge of the graph. In addition, there must be oneInitial statusAnd at least oneAcceptance status. The following figure shows a finite automatic machine. The State indicated by an arrow pointing from the outside indicates the initial state, and the state of a black circle indicates the acceptance state:
Now let's take a look at how the finite automatic machine processes the input string:
- At the beginning, the automatic machine is in the initial state.
- The first character of the input string. At this time, it is automatically used to query the edges that match the input character in the current state and convert the edges to the next state.
- Enter the next character and repeat the second step to query the edges in the current status and convert the status.
- After all the strings are input, if the automatic machine is in the accept state, it means that the automatic machineAccept this string.
Assume that the input string is"Hello"(With quotation marks ). At the beginning, the state machine is in the state 1. After the quotation marks are entered, the state 2 is converted along the quotation marks. Then, the input hello will return to the state 2 along the-z side, and finally the quotation marks are entered, switch to status 3. Because status 3 is the accept state, the automatic machine will accept this string. If the string is"Abc(Without quotation marks), when the string input is completed, the automatic opportunity is in status 2, and status 2 is not in the accept status, so this automatic machine will not accept"AbcThis string. A set of all strings accepted by an automatic machine is calledLanguage. The concept of language here is the same as the concept of the language in which we introduced regular expressions. It represents a string set on a finite character set.
The following figure shows an automatic machine.Deterministic Finite Automaton(DFA), the feature is that each State can only emit an edge with a certain symbol. That is to say, the same symbol cannot appear on two sides of the same status. However, there is anotherNon-deterministic finite automaton(NFA), which allows multiple edges with the same symbol from a State, or even an edge marked with the ε (null) symbol, that is, NFA can be automatically converted to the next state along the ε edge without entering any characters. It shows a non-deterministic finite automaton:
Which side does a non-deterministic finite-state machine select when two sides have the same symbol? Will the e-edge be transferred or not? The answer is that NFA will automaticallyGuessWhich side should be selected and can be guessed every time. For example, if the input string isAaIt will select the right path and accept the string. If the input character isAaaIt will go to the left path and accept the string. It will never input a character that isAaaSelect the right path and then make the decision not to accept this decision. Since our computer does not have this "speculative" capability, we may wonder if NFA has this capability. Some people may feel that NFA is more powerful than DFA because of its ability to automatically guess when they are new to these concepts. But in fact,DFA, NFA, and regular expressions are equivalent.Any NFA has a DFA that accepts the same language and a regular expression that defines the same language. Similarly, any regular expression, there is also an NFA that accepts the language defined by it and a DFA. Although the three models have different definitions, they indicate the sameRegular Language. Fortunately, you only need simple rules to convert any regular expression to NFA, and any NFA can be converted to DFA, in this way, we can convert regular expressions into easy-to-program DFA for lexical analysis. (Note: There are also regular expression engines that directly simulate NFA running to parse strings. Interested readers can find relevant information on their own .)
Now let's look at how to convert a regular expression to NFA. The regular expression we learned last time has two basic elements: character expression and ε expression, and three basic operations-and, join, and closure. First, let's look at the most basic ε expression. Its NFA is like this:
The following is the character expression a, whose NFA is like this:
All regular expressions can be converted into a regular expression with an input edge and an accepted state. Let's assume that the NFA of a general regular expression is like this:
Then we define two regular expressions and perform the parallel operation. the NFA of X | Y is: (in actual application, we can often simplify part of the ε conversion edge)
The Join Operation of the two regular expressions. the NFA of XY is:
For the occurrence closure operation of a regular expression, the NFA of Y * is:
Recursive use of the above rules can convert any regular expression to NFA. Let's try it. The previous study of the regular expression of the identifier [a-z] [a-z0-9] *, using the above rules, the converted NFA is:
During lexical analysis, we need to convert the regular expressions of all words into NFA, and then connect all NFA with the "and" relationship to form the final NFA required for lexical analysis.
Let's take a look at how the above logic is implemented in VBF. Compilers. Last time, we defined a RegularExpression base class and its five sub-classes, which correspond to the basic elements and basic operations of Regular Expressions respectively. Considering that converting a regular expression to NFA is a relatively independent operation, we use the Visitor mode to define an abstract class as Visitor:
public abstract class RegularExpressionConverter<T>{ protected RegularExpressionConverter() { } public T Convert(RegularExpression expression) { if (expression == null) { return default(T); } return expression.Accept(this); } public abstract T ConvertAlternation(AlternationExpression exp); public abstract T ConvertSymbol(SymbolExpression exp); public abstract T ConvertEmpty(EmptyExpression exp); public abstract T ConvertConcatenation(ConcatenationExpression exp); public abstract T ConvertAlternationCharSet(AlternationCharSetExpression exp); public abstract T ConvertStringLiteral(StringLiteralExpression exp); public abstract T ConvertKleeneStar(KleeneStarExpression exp);} |
Then we add an Accept abstract Method to the RegularExpression class so that its sub-classes can be implemented separately. For example, the Accept of the KleeneStarExpression class can be written as follows:
internal override T Accept<T>(RegularExpressionConverter<T> converter){ return converter.ConvertKleeneStar(this);} |
Finally, we implement an NFAConverter to implement the abstract class RegularExpressionConverter <NFAModel>. NFAModel is our NFA object model, which defines the concepts of State Nodes and edges. The following are the rules for translating the CL closure in NFAConverter:
public override NFAModel ConvertKleeneStar(KleeneStarExpression exp){ var innerNFA = Convert(exp.InnerExpression); var newTail = new NFAState(); var entry = new NFAEdge(newTail); innerNFA.TailState.AddEmptyEdgeTo(newTail); newTail.AddEdge(innerNFA.EntryEdge); var kleenStarNFA = new NFAModel(); kleenStarNFA.AddStates(innerNFA.States); kleenStarNFA.AddState(newTail); kleenStarNFA.EntryEdge = entry; kleenStarNFA.TailState = newTail; return kleenStarNFA;} |
The Code should be quite intuitive. It is to repeat the above drawing logic. First, convert the expressions inside the kilin closure into NFA, and then create some auxiliary peripheral states and corresponding State transformations.
With the algorithm for converting regular expressions to NFA, we also need to convert NFA to DFA. This conversion algorithm is calledSubset Construction". We have mentioned earlier that NFA can automatically guess which side to transfer when different edges with the same symbol are sent in the same state. The idea of subset construction is not to guess which state NFA will be transferred, but to assume that NFA canAll possible statuses at the same time. For example, we re-consider the NFA we first showed. In the beginning, the initial state of the NFA contains two ε transformations. We assume that the NFA can be in all the target States of this ε transformation at the same time, that is to say, its initial state is actually a set of three states:
We call these three states as initial states.ε-Closure(ε-closure ). Next, if the character a is entered, the NFA can start from any state in the ε-Closure of the Current State and convert the State through the side of the Character. Then, we will get the next status of NFA:
Next, enter the character a again. We can also start from the current status set and find the next status set:
If the string ends, the NFA status set contains an acceptance status, so NFA decides to accept the string "aa ". That is to say, this time, without the ability to guess, we successfully Parsed the aa string. In this way, we will understand that,There must be a DFA, and each of its states is a subset of the NFA state. Next I will briefly describe the subset Construction Algorithm in the tiger book. LingEdge(S, c) indicates a set of All NFA states that can be achieved from the State s along the edge marked with character c. Set the status S,Closure(S) is a set of States that can be achieved only along the ε edge without accepting any characters. It can be obtained through iteration:
Next we define the rule of action after entering a character, starting from the state in the NFA status set d, entering the symbol c, and recording the new State set of NFA that can be achievedDFAedge(D, c), which is defined
Finally, assume that the character set of the language isΣTo construct the entire DFA algorithm:
The above code can be easily converted into code after understanding the basic principles of subset construction. The NFA-> DFA conversion code in VBF. Compilers is long and I will not post it here. If you are interested, you can download it on github.
After the regular expression is finally converted to DFA through NFA, it is quite easy to perform real string scanning. In the next article, we will introduce specific practices and methods for handling Unicode character sets. Next, I will introduce the basic usage of the VBF. Compilers. Scanners class library. If you don't want to implement the entire algorithm, you can refer to my article next and use the VBF library to create a lexical analyzer. So, please wait for the next article!
Also don't forget to pay attention to my VBF project: https://github.com/Ninputer/VBF and my weibo: http://weibo.com/ninputer thank you for your support!