ArticleDirectory
- 2.1 subset Construction Method
- 2.2 examples of subset Constructor
- Subset construction of more than 2.3 first states
- 2.4 DFA status symbol Index
- 2.5 Implementation of the subset Construction Method
- 2.6 DFA dead state
- 3.1 minimize DFA
- 3.2 DFA minimization example
- 3.3 character class minimized
Series navigation
- (1) Introduction to lexical analysis
- (2) Input buffering and code locating
- (3) Regular Expressions
- (4) construct NFA
- (5) DFA Conversion
- (6) construct a lexical analyzer
In the previous article, we have obtained an equivalent NFA to a regular expression. This article describes how to convert NFA to DFA and how to simplify DFA and character classes.
1. dfa Representation
DFA is similar to NFA, but it is much simpler. You only need a method to add a new state. DFA classCodeAs follows:
Namespace cyjb. compiler. lexer {class DFA {// create a new state in the current DFA. Dfastate newstate (){}}}
The status of DFA is also relatively simple, and there are only two necessary attributes: Symbol index and state transfer.
The symbol index indicates the regular expression corresponding to the current acceptance status. However, a state of DFA may correspond to multiple states of NFA (see the subset constructor below). Therefore, the symbol index of the DFA state is an array. For normal states, the symbol index is an empty array.
Status transfer indicates how to move from the current status to the next status. Because NFA has been divided into character classes, therefore, the DFA directly uses an array to record the transfer of different character classes (the DFA does not have $ \ Epsilon $ transfer, and each character class has only one transfer ).
In the NFA state definition, there is also a state type attribute, but it is not in the DFA state, this is because the status of the trailing type will be processed when the DFA matches the string (will be described in the next article ), the status of the trailinghead type is merged with that of the normal type when the DFA is constructed (see section 2.4 ).
The following is the definition of the dfastate class:
Namespace cyjb. compiler. lexer {class dfastate {// obtain the DFA that contains the current status. DFA {Get; private set;} // gets or sets the index in the current state. Int index {Get; set;} // gets or sets the symbol index in the current state. Int [] symbolindex {Get; set;} // gets or sets the status of a specific character class. Dfastate this [int charclass] {Get; Set ;}}}
The two additional attributes DFA and index defined in the DFA status are also used to facilitate the usage of the status.
Ii. NFA conversion to dfa2.1 subset Construction Method
Convert NFA to DFA usingSubset Construction)Algorithm. The process of this algorithm is similar to the NFA matching process mentioned in section 3.1 of "C # lexical analyzer (3) Regular Expression. In the NFA matching process, All NFAStatus setThen, the subset constructor uses a state of DFA to correspond to a state set of NFA. That is, DFA reads the state that arrives after the input string $ a_1a_2 \ cdots a_n $, this is a set of statuses that NFA reads from the same string $ a_1a_2 \ cdots a_n $.
Operations required for subset construction algorithms include:
Operation |
Description |
$ \ Epsilon \ Text {-} closure (s) $ |
The NFA status starts from $ S $ and transfers the NFA status set only through $ \ Epsilon $. |
$ \ Epsilon \ Text {-} closure (t) $ |
Starting from a NFA Status $ S $ in $ T $, the NFA status set that can be reached can be transferred only through $ \ Epsilon $, that is, $ \ cup _ {s \ In t} \ Epsilon \ Text {-} closure (s) $ |
$ Move (t, a) $ |
The NFA status set that can start from a status $ S $ in $ T $ and be transferred to by a transfer label $ A $ |
What we need to find is the set of all states that an NFA $ N $ may be located after reading an input string.
First, before reading the first character, $ N $ can be in any status in $ \ Epsilon \ Text {-} closure (s_0) $, $ s_0 $ indicates the Starting Status of $ N $. Then, $ \ Epsilon \ Text {-} closure (s_0) $ indicates the Starting Status of DFA.
Suppose $ N $ the input string $ x $ can be located in the status of $ T $ in the set, and the next input character is $ A $, then $ N $ can be immediately moved to any status in $ move (t, a) $, you can also use $ \ Epsilon $ transfer to move to any status in $ \ Epsilon \ Text {-} closure (move (t, a) $. In this way, every different $ \ Epsilon \ Text {-} closure (move (t, a) $ represents the status of a DFA. If this description is hard to understand, you can refer to the example below.
Accordingly, you can obtain the following algorithm ($ t [a] = U $ in the algorithm indicates that the character class $ A $ in the Status $ T $ exists on the transition to the status $ U $):
Input: an NFA $ N $ output: equivalent to nfa dfa $ d $ at the beginning, $ \ Epsilon \ Text {-} closure (s_0) $ is a unique status in $ d $ and is not marked while (an unmarked status exists in $ d $ T $) {Add the flag foreach to $ T $ ($ A $ for each character class) {$ u = \ Epsilon \ Text {-} closure (move (t, )) $ if ($ U $ not in $ d $) {Add $ U $ to $ d $ without being marked} $ t [a] = U $ }}
If an NFA is in the final state, all the DFA states that contain it are also in the final state, and the symbol indexes in the DFA State include the symbol indexes corresponding to the NFA state. A dfa State may correspond to multiple NFA States. Therefore, when dfastate is defined above, the symbol index is an array.
The process of calculating $ \ Epsilon \ Text {-} closure (t) $ is a simple graph search process starting from a State set. It can be implemented using DFS, the specific algorithm is as follows ($ \ Epsilon \ Text {-} closure (s) $, equivalent to $ \ Epsilon \ Text {-} closure (\ {s \}) $ ):
Input: NFA status set $ T $ output: $ \ Epsilon \ Text {-} closure (t) $ push all States of $ T $ into the stack $ \ Epsilon \ Text {-} closure (t) = T $ while (the stack is not empty) {top stack elements $ T $ foreach ($ U $: $ T $ can be transferred to $ U $ through $ \ Epsilon $) {if ($ U \ notin \ Epsilon \ Text {-} closure (t) $) {$ \ Epsilon \ Text {-} closure (t) =\epsilon \ Text {-} closure (t) \ cup \ left \ {u \ right \} $ push $ U $ into the stack }}}
The algorithm for calculating $ move (t, a) $ is simpler, with only one loop:
Input: NFA status set $ T $ output: $ move (t, a) $ move (t, a) = \ emptyset $ foreach ($ U \ in T $) {if ($ U $ there is a transfer on the character class $ A $, the target is $ T $) {$ move (t, a) = move (T,) \ cup \ left \ {T \ right \ }$ }}
2.2 examples of subset Constructor
Here, the NFA constructed from the regular expression (A | B) * Baa is used as an example to convert it to DFA. Enter the alphabet $ \ Sigma =\{ A, B \}$.
Figure 1 Regular Expression (A | B) * NFA of BAA
Figure 2 DFA Construction Example
Figure 3 final DFA
Subset construction of more than 2.3 first states
The NFA constructed in the previous section has multiple starting states (in order to support context and the first line delimiter), but does not affect the subset constructor, because the subset construction method starts from the starting state and continuously constructs the corresponding DFA State along the transfer of NFA, you only need to call your own constructor for multiple start states to construct multiple DFA correctly, without worrying about the mutual influence between DFA. For the sake of convenience, the multiple DFA are still saved in one DFA, but the start status is used for distinction.
2.4 DFA status symbol Index
A dfa State corresponds to a state set of NFA, so you can directly retrieve all the symbol indexes of the multiple NFA states. However, as mentioned above, the nfa state of the trailinghead type will be merged with the normal NFA state when the DFA is constructed. This merge refers to the merge of the symbol indexes.
This merge method is also very simple. The normal type of status directly uses the symbol index, and the trailinghead type of status, Int. the value of maxvalue-symbolindex is used as the symbol index of the DFA State, so that the two types of States can be distinguished (because there are not many defined symbols, you do not have to worry about repeated or negative values ).
Finally, sort the symbol indexes in the DFA status from small to large. In this way, the normal-type symbolic indexes are always placed at the top of the trailinghead-type symbolic indexes, which is easier to process and more efficient for subsequent lexical analysis.
2.5 Implementation of the subset Construction Method
The C # Implementation of the subset constructor method is basically consistent with the pseudo code given above. However, there is a problem to be solved here, that is, how to efficiently obtain the corresponding DFA status from the NFA status set. Because the NFA status set is saved using hashset <nfastate>, I directly use dictionary Weak Hash FunctionSo that the hash value corresponding to the set is only related to the elements in the set, but not to the element order.
The methods defined in the NFA class are as follows:
/// <Summary> /// DFA is constructed based on the current NFA and the subset construction method is used. /// </Summary> /// <Param name = "headcnt"> Number of header nodes. </Param> internal DFA builddfa (INT headcnt) {DFA = new DFA (charclass); // status ing table between DFA and NFA, a dfa State corresponds to a state set of NFA. Dictionary <dfastate, hashset <nfastate> statemap = new dictionary <dfastate, hashset <nfastate> (); // The ing table from the NFA status set to the corresponding DFA status (reciprocal with the preceding table ). Dictionary
In this implementation, the Starting Status of the DFA symbol index is set to an empty array, so that the empty string $ \ Epsilon $ will not be matched (other matches will not be affected ), that is, DFA matches at least one character. This method makes sense in lexical analysis, because it cannot be a null string.
2.6 DFA dead state
Strictly speaking, the DFA obtained from the above algorithm may not be a DFA, because DFA requires that each State has only one transfer in each character class. The DFA generated by the above algorithm may not be transferred in some character classes, because in the algorithm, if the NFA state set corresponding to this transfer is an empty set, this transfer will be ignored. If it is a strict DFA, add a transfer to the dead state $ \ emptyset $ (the transfer of the dead State on all character classes reaches its own ).
However, in lexical analysis, you need to know when there is no possibility of being accepted by the DFA, so that you can know whether the correct word has been matched. Therefore, in lexical analysis, the transfer to the dead state will be eliminated. If the conversion on an input symbol is not found, it is deemed that the correct word has been matched at this time (the word corresponding to the last ending state ).
3. dfa simplification 3.1 DFA Minimization
Although an available DFA is constructed above, it may not be optimal. For example, the following two equivalent DFA Methods recognize Regular Expressions (A | B) * Baa, but there are different states.
Figure 4 two equivalent DFA
Obviously, DFA with fewer States leads to higher matching efficiency. Therefore, some algorithms are required to minimize the number of DFA states, that is, the simplification of DFA.
The idea of simplifying DFA is to look for equivalent states-they all (not) are accept states, and for any input, they are always transferred to the equivalent State. After finding all the equivalent states, you can combine them into one state to minimize the number of DFA states.
There are two methods to find the equivalent State: segmentation and merge.
- The division method first regards all accepted states and all non-accepted states as two equivalent State sets, and then splits them out an unequal state subset, until all the remaining equivalent State sets cannot be split.
- The merge method first considers all States as unequal, then finds two (or more) equivalent states from them, and merges them into one state.
Both methods can simplify DFA, but the merging method is complicated. Therefore, we use the segmentation method to simplify DFA.
The DFA minimization algorithm is as follows:
Input: a dfa $ d $ output: equivalent to $ d $, the simplest DFA $ d' $ constructs the initial division of $ d $ \ Pi $. The initial Division includes two groups: acceptance status group and non-acceptance status group while (true) {foreach (group $ g \ In \ Pi $) {divide $ G $ into smaller groups, so that the two States $ S $ and $ T $ are in the same group and only when all input symbols, $ S $ and $ T $ both reach the same group in $ \ Pi $. Save the new group to $ \ PI _ {New} $. If ($ \ pi _ {New} \ ne \ Pi $) {$ \ PI =\pi _ {New }$} else {$ \ PI _ {final }=\ Pi $ break ;}} in $ \ PI _ {final} $, a status is selected as the representative of the Group. These representatives constitute the status of $ d '$. $ D' $ indicates a group with the Starting Status of $ d $. $ D' $ indicates the acceptance status of a group that contains the acceptance status of $ d $. If $ S $ is the status (not representative) in $ G $ of a group in $ \ PI _ {final} $, transfer $ d '$ to $ S $, all are changed to the transfer represented by $ G $.
Because the acceptance status and non-acceptance status are divided at the very beginning, there is no group that contains the acceptance status and non-acceptance status.
In actual implementation, note that a DFA status may correspond to multiple terminologies. Therefore, when dividing the initial status, different terminologies of the corresponding Terminator must also be divided into different groups.
3.2 DFA minimization example
The following figure uses 4 (a) as an example to illustrate the DFA minimization example.
The initial Division includes two groups: $ \ {A, B, C, D \} $ and $ \ {e \} $. They are not accept status groups and accept status groups.
For the first split, in the $ \ {A, B, C, D \} $ group, for characters a, Status $ A, B, C $ are all transferred to the group status, the Status $ d $ is transferred to the group $ \ {e \} $, so the status $ d $ needs to be divided. For character B, all statuses are transferred to the group status, which cannot be distinguished. The $ \ {e \} $ group contains only one status and does not need to be further divided. This round of $ \ PI _ {New} =\left \{\{ a, B, c \}, \{ D \}, \ {e \} \ right \} $.
For the second split, in the $ \ {A, B, C \} $ group, for characters a, Status $ A, and B $ are all transferred to the group status, the Status $ C $ is transferred to the group $ \ {d \} $, which cannot be distinguished from character B; groups $ \ {d \} $ and groups $ \ {e \} $ are not divided. This round of $ \ PI _ {New} =\left \{\{ A, B \}, \ {C \}, \ {d \}, \ {e \} \ right \} $.
For the third split, the only group that may be split is $ \ {A, B \} $. For characters a and B, both are transferred to the same group, so they are not split. Therefore, $ \ PI _ {final} =\left \{\{ A, B \}, \ {C \}, \ {d \}, \ {e \} \ right \} $.
Finally, a minimal DFA is constructed, which has four states, corresponding to the four groups of $ \ PI _ {final} $. Select $ A, C, D, and E $ as the representatives of each group. $ A $ indicates the Starting Status and $ e $ indicates the acceptance status. Change the transfer from the status to $ B $ to the transfer from $ A $. The DFA conversion table is as follows:
DFA status |
Transfer on |
Transfer on B |
A |
A |
C |
C |
D |
C |
D |
E |
C |
E |
A |
C |
Finally, sort the status again to obtain the DFA shown in 4 (B.
3.3 character class minimized
After the DFA is minimized, the character classes are also minimized, because the DFA minimization process merges the equivalent State, which may make some character classes equivalent, as shown in Figure 5.
Figure 5 equivalent character classes
Searching for equivalent character classes is easier than searching for equivalent states.Simplified DFAWrite it in the form of a table, as shown in Figure 5 DFA:
DFA status |
Transfer on |
Transfer on B |
Transfer on C |
A |
B |
B |
$ \ Emptyset $ |
B |
B |
B |
C |
C |
$ \ Emptyset $ |
$ \ Emptyset $ |
$ \ Emptyset $ |
The first column in the table is the status of DFA, and the last three columns represent the transfer of different character classes. The second and fourth rows of the table correspond to the transfer of the, B, and C states respectively. If two columns in the table are identical, the corresponding character classes are equivalent.
There are a lot of implementation code for simplifying DFA and character classes, so I will not post it here. For details, see DFA class.
The simplified DFA is generally saved in the form of a transfer table (that is, the form of the above table). The following three Arrays can be used to completely represent the DFA.
Int [] charclass; int [,] transitions; int [] [] symbolindex;
Here, charclass is a ing table of character classes. It is an array of 65536 characters and is used to map characters to corresponding character classes. transitions is a DFA transfer table, the number of rows equals to the number of States in DFA, and the number of columns is the number of character classes; symbolindex is the symbol index corresponding to each State.
Of course, you can also compress the DFA transfer table and symbol index to save memory, but this will be done later.
The next article will introduce how to construct a lexical analyzer based on DFA.
DFA constructor and other methods are in the lexerrule class. Other related code can be found here, and some basic classes (such as input buffer) are here.