With the regular expression obtained in the previous section, it can be used to construct NFA. NFA can be easily converted from a regular expression, and helps to understand the pattern represented by a regular expression.
1. NFA Representation
Here, an NFA has at least two States: the first State and the last state, as shown in 1. The regular expression $ t $ corresponds to the NFA as N (t ), its initial status is $ H $, and its tail status is $ T $. Only the first and last states are shown in the figure. The transfer between other States and States is not shown. This is because in the recursive algorithm described below, you only need to know the first and last states of NFA, other information does not need to be concerned.
Figure 1 NFA Representation
I use the following Nfa class to represent an NFA, which only contains the first State, the last state, and a method for adding a new state.
Copy codeThe Code is as follows: namespace Cyjb. Compiler. Lexer {
Class Nfa {
// Obtain or set the NFA first state.
NfaState HeadState {get; set ;}
// Obtain or set the end status of NFA.
NfaState TailState {get; set ;}
// Create a new status in the current NFA.
NfaState NewState (){}
}
}
In the NFA status, there are only three necessary attributes: Symbol index, state transfer, and State type. Only the symbolic index of the Acceptance status makes sense. It indicates the regular expression corresponding to the current acceptance status. For other statuses, it is set to-1.
Status transfer indicates how to move from the current state to the next state. Although in the definition of NFA, each node may contain multiple transfer tokens and multiple characters (that is, the transfer marked with characters on the edge ). But here, there is at most one character transfer, which is determined by the features of the NFA constructor given later.
The status type is defined to support forward-looking symbols. It may be one of Normal, TrailingHead, and Trailing enumerated values. This attribute will be described in detail in the forward-looking symbols section.
The NfaState class is defined as follows:
Copy codeThe Code is as follows: namespace Cyjb. Compiler. Lexer {
Class NfaState {
// Obtain the NFA that contains the current status.
Nfa;
// Obtain the index in the current state.
Int Index;
// Obtain or set the symbol index in the current state.
Int SymbolIndex;
// Obtain or set the type of the current status.
NfaStateType StateType;
// Obtain the list of character classes corresponding to the transfer of character classes.
ISet <int> CharClassTransition;
// Obtain the target status of character class transfer.
NfaState CharClassTarget;
// Obtain the sequence transfer set.
IList <NfaState> EpsilonTransitions;
// Add a transfer to a specific State.
Void Add (NfaState state, char ch );
// Add a transfer to a specific State.
Void Add (NfaState state, string charClass );
// Add an ε transfer to a specific State.
Void Add (NfaState state );
}
}
The two additional attributes Nfa and Index I have defined in the NfaState class are simply for the convenience of using the state. $ \ Epsilon $ transfer is directly defined as a list, while character transfer is defined as two attributes: CharClassTarget and CharClassTransition. CharClassTarget indicates the target status and CharClassTransition indicates the character class, the character classes are described in detail below.
The NfaState class also defines three Add methods, which are used to Add single character transfer, character class transfer, and $ \ epsilon $ transfer.
2. Construct NFA from a regular expression
The recursive algorithm used here isMcMaughton-Yamada-Thompson Algorithm(Or Thompson constructor), which is easier to understand than Glushkov constructor.
2.1 Basic Rules
For the regular expression $ \ epsilon $, construct the NFA of 2 (. For a regular expression $ \ bf {a} $ that contains a single character $ a $, construct the NFA of 2 (B.
Figure 2 Basic Rules
The first basic rule above is not actually used here, because $ \ epsilon $ is not defined in the definition of a regular expression. The second rule is used in the regular expression CharClassExp class that represents the character class. The Code is as follows:
Copy codeThe Code is as follows: void BuildNfa (Nfa nfa ){
Nfa. HeadState = nfa. NewState ();
Nfa. TailState = nfa. NewState ();
// Add a character class for transfer.
Nfa. HeadState. Add (nfa. TailState, charClass );
}
2.2 induction rules
With the above two basic rules, the induction rules described below can be used to construct a more complex NFA.
Assume that the regular expression s and t's NFA are N (s) and N (t) respectively ).
1. for r = s | t, construct the NFA of 3, add a new First State H and a new tail state T, then from H to N (s) and N (t) the first state of each has a failover, from H to N (s) and N (t) each has a Failover to the new tail state T. Obviously, after H, you can choose to match N (s) or N (t) and eventually reach T.
Figure 3 induction rule AlternationExp
Note that the statuses in $ N (s) $ and $ N (t) $ cannot affect each other or be transferred, otherwise, the recognition results may not be expected.
The code in the AlternationExp class is as follows:
Copy codeThe Code is as follows: void BuildNfa (Nfa nfa ){
NfaState head = nfa. NewState ();
NfaState tail = nfa. NewState ();
Left. BuildNfa (nfa );
Head. Add (nfa. HeadState );
Nfa. TailState. Add (tail );
Right. BuildNfa (nfa );
Head. Add (nfa. HeadState );
Nfa. TailState. Add (tail );
Nfa. HeadState = head;
Nfa. TailState = tail;
}
2. for the NFA constructed for $ r = st $, the first State of $ N (s) $ is used as the first state of $ N (r) $, $ N (t) the end state of $ is the end state of $ N (r) $, and is in the end state of $ N (s) $ and $ N (t) add a $ \ epsilon $ transfer between the first State of $.
Figure 4 inductive rule ConcatenationExp
The code in the ConcatenationExp class is as follows:
Copy codeThe Code is as follows: void BuildNfa (Nfa nfa ){
Left. BuildNfa (nfa );
NfaState head = nfa. HeadState;
NfaState tail = nfa. TailState;
Right. BuildNfa (nfa );
Tail. Add (nfa. HeadState );
Nfa. HeadState = head;
}
LiteralExp can also be considered as a connection with multiple CharClassExp, so this rule can be applied multiple times to construct the corresponding NFA.
3. for $ r = s * $, construct the NFA of 5, add a new first state $ H $ and a new tail state $ T $, and then add four items $ \ epsilon $ for transfer. However, in the regular expression definition, $ r * $ is not explicitly defined. Therefore, the RepeatExp rule is given below.
Figure 5 rule s *
4. for $ r = s \ {m, n \} $, construct the NFA of 6 and Add a new first state $ H $ and new last State $ T $, create $ n $ N (s) $ and connect it, starting from $ m-1 $ N (s) $, add a tail state to $ T $ \ epsilon $ transfer (if $ m = 0 $, add the $ \ epsilon $ transfer from $ H $ to $ T $ ). This ensures that at least $ m $ N (s) $ is passed, and at most $ n $ N (s) $ is passed.
Figure 6 inductive rule RepeatExp
However, if $ n = \ infty $, You need to construct the NFA of 7. In this case, you only need to create $ m $ N (s) $ and at the last $ N (s) add a $ \ epsilon $ transfer between the beginning and end states of $, which is similar to $ s * $, to achieve a matching with no upper limit. If $ m = 0 $ is added, it is the same as $ s * $.
Figure 7 inductive rule RepeatExp $ n = \ infty $
Based on the above two rules, the RepeatExp class constructor is obtained:
Copy codeThe Code is as follows: void BuildNfa (Nfa nfa ){
NfaState head = nfa. NewState ();
NfaState tail = nfa. NewState ();
NfaState lastHead = head;
// If there is no upper limit, special processing is required.
Int times = maxTimes = int. MaxValue? MinTimes: maxTimes;
If (times = 0 ){
// Construct at least once.
Times = 1;
}
For (int I = 0; I <times; I ++ ){
InnerExp. BuildNfa (nfa );
LastHead. Add (nfa. HeadState );
If (I> = minTimes ){
// Add the transfer to the final tail state.
LastHead. Add (tail );
}
LastHead = nfa. TailState;
}
// Add transfer for the last node.
LastHead. Add (tail );
// No upper limit.
If (maxTimes = int. MaxValue ){
// Add an infinite loop at the end.
Nfa. TailState. Add (nfa. HeadState );
}
Nfa. HeadState = head;
Nfa. TailState = tail;
}
5. for the forward sign $ r = s/t $, the situation should be special. Here we only set $ N (s) $ and $ N (t) $ connect (same as Rule 2 ). Because if $ t $ matches the forward symbol, You need to trace back to find the end of $ s $ (this is the truly matched content ), therefore, you need to mark the tail state of $ N (s) $ as the TrailingHead type, and mark the tail state of $ N (T) $ as the Trailing type. The marked processing is described in the next section when it is converted to DFA.
2.3 example of constructing NFA using regular expressions
Here is an example to intuitively see how a regular expression (a | B) * baa constructs a corresponding NFA. The following describes each step in detail.
Figure 8 Regular Expression (a | B) * baa construction NFA example
The final NFA is shown in. A total of 14 States are required. In NFA, each part of the regular expression can be distinguished. The NFA constructed here is not the simplest, so it is different from the NFA in the previous section "C # lexical analyzer (3) Regular Expression. However, NFA is only necessary to construct a DFA, so it does not need to be simplified.
Iii. Division of character classes
Although NFA has been obtained, this NFA still has some details to handle. For example, for a regular expression [a-z] z, what kind of NFA should be constructed? As one transfer can only correspond to one character, one possible case is 9.
Figure 9 NFA constructed by [a-z] z
A total of 26 transitions are required between the first two States, and one is required between the last two states. What if the character range of the regular expression is wider, such as the Unicode range? Adding more than 60 thousand transfer entries is obviously unacceptable for both time and space. Therefore, we need to use character classes to reduce the number of required transfers.
A character class refers to the character equivalence class, which means that all characters corresponding to a character class have the same status transfer. In other words, for an automatic machine, there is no need to distinguish characters in a character class-because they always point to the same State.
Just like the regular expression [a-z] z above, there is no need to differentiate characters a-y because they always point to the same State. Character z needs to be taken out as a character class separately, because the transfer between state 1 and 2 makes the character z different from other characters. Therefore, we now get two character classes. The first character class corresponds to the character a-y, and the second character class corresponds to the character z, as shown in the obtained NFA 10.
Figure 10 [a-z] z NFA constructed using character classes
After the character class is used, the number of transfers required is reduced to three. Therefore, when dealing with a large alphabet, the character class is necessary, which can speed up processing, it can also reduce memory consumption.
The division of character classes is the process of dividing Unicode characters into different character classes. My current algorithm is an online algorithm, that is, when a new transfer is added, the current character class is checked to determine whether to divide the existing character classes, the corresponding character classes are also obtained. The character class uses an ISet <int>, because one transfer may correspond to multiple character classes.
Initial: Only one character class indicates the entire Unicode range.
Input: Newly Added transfer $ t $
Output: New transfer character class $ cc_t $
For each (each existing character class $ CC $ ){
$ Cc_1 = \ left \ {c | c \ in t \ & c \ in CC \ right \} $
If ($ cc_1 = \ emptyset $) {continue ;}
$ Cc_2 = \ left \ {c | c \ in CC \ & c \ notin t \ right \} $
Divide $ CC $ into $ cc_1 $ and $ cc_2 $
$ Cc_t = cc_1 \ cup cc_t $
$ T = \ left \ {c | c \ in t \ & c \ notin CC \ right \} $
If ($ t = \ emptyset $) {break ;}
}
Note that every time an existing character class $ CC $ is divided into two subcharacter classes $ cc_1 $ and $ cc_2 $, all character classes corresponding to the transfer that includes $ CC $ must be updated to $ cc_1 $ and $ cc_2 $ to include the newly added child character classes.
I have implemented this algorithm in the CharClass class, which fully utilizes the high efficiency of CharSet class set operations.
Copy codeThe Code is as follows: View Code
HashSet <int> GetCharClass (string charClass ){
Int cnt = charClassList. Count;
HashSet <int> result = new HashSet <int> ();
CharSet set = GetCharClassSet (charClass );
If (set. Count = 0 ){
// Does not contain any character classes.
Return result;
}
CharSet setClone = new CharSet (set );
For (int I = 0; I <cnt & set. Count> 0; I ++ ){
CharSet cc = charClassList [I];
Set. receivtwith (cc );
If (set. Count = setClone. Count ){
// The current character class does not overlap with set.
Continue;
}
// Obtain the overlapping part of the current character class and set.
SetClone. Fig (set );
If (setClone. Count = cc. Count ){
// It is completely contained by the current character class and can be directly added.
Result. Add (I );
} Else {
// Remove the split part from the current character class.
Cc. receivtwith (setClone );
// Update the character class.
Int newCC = charClassList. Count;
Result. Add (newCC );
CharClassList. Add (setClone );
// Update the old character class ......
}
// Re-copy the set.
SetClone = new CharSet (set );
}
Return result;
}
4. Multiple regular expressions, delimiters, and context
Through the above algorithm, you can convert a single regular expression to the corresponding NFA. If there are multiple regular expressions, it is also very simple. As long as you add a new first node as in 11, transfer multiple records to the first state of each regular expression $ \ epsilon $. The final NFA has a starting status and $ n $ accept status.
Figure 11 NFA of multiple Regular Expressions
For the end of a line, you can directly regard it as a pre-defined forward symbol. r \ $ can be regarded as r/\ n or r/\ r? \ N (this supports Windows line breaks and Unix line breaks), which is actually the same.
For the first line qualifier, this regular expression is only matched when the first line is used. You can consider taking out such regular expressions separately-when matching starts from the beginning of the line, match with the regular expression defined at the beginning of the line. When matching starts from other positions, match with other regular expressions.
Of course, even if the match starts from the beginning of a row, the regular expression defined by the non-beginning of the row can also be matched. Therefore, all regular expressions are divided into two sets, one containing all regular expressions, it is used to match from the beginning of a row. The other contains only regular expressions that are not limited to the beginning of a row. It is used to match from other locations. Then, the corresponding NFA is constructed for the two sets respectively.
My lexical analyzer also supports context. You can specify one or more contexts for each regular expression. This regular expression takes effect only in the given context. The context mechanism can be used to control the matching of strings more precisely, and a more powerful lexical analyzer may be constructed. For example, escape characters in strings can be processed while matching strings.
The implementation of context is the same as the first qualifier of the above line. It is to divide the regular expressions corresponding to each context into a group and construct NFA respectively. If a regular expression belongs to multiple contexts, it is copied to multiple groups.
Assume that $ N $ context is defined, and the first line qualifier is added. In total, You need to divide the regular expression into $ 2N $ sets and construct NFA for each set. This will inevitably cause some memory waste, but the string matching speed will be very fast, and the memory waste can be reduced to a certain extent through compression. If you maintain specific information for each State to enable the upper and lower limits and the first limit for the row, although NFA becomes smaller, storing information for each State also consumes additional memory, there will also be many backtracing cases during matching (backtracing is a performance killer), and the effect may not be good.
Although you need to construct $ 2N $ NFA, you only need to construct an NFA with a starting state of $ 2N $. Each starting State corresponds to a context (not) A set of regular expressions defined at the beginning of a row. This is done to ensure that the $ 2N $ NFA character classes are the same, otherwise it will be very troublesome to process later.
Now, the NFA corresponding to the regular expression is constructed. In the next article, I will introduce how to convert NFA into equivalent DFA.