Chen Zixian vczh@163.com http://www.cppblog.com/vczh/
1. Overview of issues
With the structure of computer language becoming more and more complicated, in order to develop excellent compilers, people have gradually felt the importance of the independent study of lexical analysis. But the function of the lexical analyzer is not limited to this. Recall that our teacher was just beginning to tell us about programming, always a problem: give a string filled with arithmetic, write the program to calculate the result of the formula. In addition, we sometimes create a more complex configuration file, such as XML, the parser first of all to the lexical analysis of the file, the whole string into a relatively short mark (refers to a string of attributes), before the structural analysis. Furthermore, when implementing a console application, the program needs to analyze the user's command to enter the screen. If the command is complex enough, we first have to do lexical analysis of the command, and the resulting results will greatly facilitate the subsequent work.
Of course, most of these problems have been solved, and historically there have been a variety of specialized or general-purpose tools (LEX, regular expression engines, etc.) to solve this type of problem. When we use this tool, we need to understand the rationale behind lexical analysis in order to write configurations more efficiently, or when we need to make similar tools ourselves in a particular situation. In this paper, we will give a principle to construct the common lexical analysis tool. Because the implementation of the code is too long, this article will not be implemented.
What exactly is "break a string into some notation"? Let's start with the arithmetic formula. A arithmetic equation is a sequence of characters, but the objects we care about are actually operators, parentheses, and numbers. So the function of this method is to break a string into a property mark that we care about. For example: (11+22) * (33+44) is a valid arithmetic equation, if the input is (left parenthesis, "(") (number, "11") (first operator, "+") (number, "22") (right parenthesis, ")") (Level two operator, "*") (Opening parenthesis, "(") (Digit, " 33 ") (first-level operator," + ") (number," 44 ") (right parenthesis,") "), we only need to care about the property of the mark (that is, the opening parenthesis, the right parenthesis, the number, the operator, etc.), and we need to take care of the actual content of the mark when we do the calculation. If there are spaces in the equation, we just need to take the space as a type of notation, after lexical analysis of the results, will have the attributes of the space to discard the mark can be, the next step does not need to change.
However, it should be noted that lexical analysis results in the absence of hierarchical structure, all the tokens are equivalent objects. We think of + and * as the operators at different levels when we compute expressions, and similar structures have nested hierarchies. Lexical analysis does not produce nested hierarchies of information, and can only get information about duplicate structures.
2, Regular expression
We now need to look for a tool that can describe the type of notation before we begin by looking at the structure of the common notation. To represent a set of strings with some commonality, we need to write some rules that represent the collection of strings. All members of this collection will be considered to be a particular type of token.
First, a rule can think of a particular character or an empty string as the whole of a type of token. in the case of the arithmetic equation mentioned above, the "left parenthesis" type of notation only corresponds to the character "("), other characters or strings cannot be "left parenthesis" of this type of notation.
second, the rules can be concatenated. concatenation means that we can have the prefix of a string conform to a specified rule, the remainder of the prefix conforms to the second rule, the remaining part of the prefix conforms to the third rule, and so on, until the last part of the whole is to conform to the last rule. If we deal with the string of "function" as a token type, we can replace the string "function" with 8 inline rules: "F", "U", "n", "C", "T", "I", "O", "n". First, the prefix "F" of the string "function" conforms to the rule "F", and the remainder of the "unction" prefix "U" conforms to the rule "U", and so on, until the last part "N" is all in accordance with the rule "n".
third, the rules can be paralleled. parallel means that if a string conforms to one of a series of rules, we say that the string conforms to some of these rules in parallel. So the parallel of these rules constitutes a new rule. A typical example is to judge whether a string is a keyword. Keywords can be "if", can be "else", can be "while", and so on. Of course, a keyword cannot match these rules at the same time, but as long as a string matches one of these rules, we say the string is a keyword. Therefore, the key word this rule is "if", "Else", "while" and other rules of the parallel.
Four, a rule can be optional. The optional rule is actually a special form of parallel. To join we need the rule "abc" and "ABCDE" in parallel, we will find that these two rules have the same prefix "ABC", and this prefix happens to be one of the rules. So we can rewrite the rules into a series of parallel connections between "ABC" and "De". But the rule "" specifies that the rule is an empty string, so the parallel between this rule and "de" can be viewed as an optional rule "de".
Five, the rules can be repeated. a finite number of repetitions can be expressed by concatenation, but if we do not want to limit repetition, concatenation cannot represent the rule, so we introduce "repetition". A typical example is the identifier for a programming language. An identifier can be the name of a variable or something else. A language usually does not specify the maximum length of a variable name. So in order to represent this rule, you need to have 52 letters in parallel, and then repeat the rule.
Of the above 5 methods of constructing rules, the following 4 methods are used to make the rule combination a larger rule. In order to give a formal representation of such a rule, we introduce a paradigm. This paradigm has the following syntax:
1: The character is enclosed in double quotes, and the empty string uses ε instead.
2: Two rule-and-tail connections represent the concatenation of the rules.
3: two rule use | Separates the parallel that represents the rule.
4: The rule uses [] surround to represent the rule is optional, the rule uses {} Surround to represent this rule is duplicate.
5: The rule uses () the surround to represent the rule as a whole, usually used to change the operator | The priority level.
For example, the rules for a real number are written as follows:
{"0" | " 1 "|" 2 "|" 3 "|" 4 "|" 5 "|" 6 "|" 7 "|" 8 "|" 9 "}". [{0] | 1 "|" 2 "|" 3 "|" 4 "|" 5 "|" 6 "|" 7 "|" 8 "|" 9 "}].
But how do we represent "other characters that are not numbers?" The number of characters is limited, so we can use the parallel of the rules to express. But all the characters are really too much (the ASCII character set has 127 characters and the UTF-16 character set has 65,535 characters), so later people came up with a variety of ways to write simplified rules. The more famous is the BNF paradigm. BNF paradigms are often used in theoretical studies, but more practical are regular expressions.
The characters of a regular expression need not be enclosed in double quotes, but if you need to represent some of the defined characters (such as "|") , the method of using the escape character is represented (such as "/|"). Second, the X represents [x],x+ representative {x},x* {x}]. Character sets can be expressed in intervals, [0-9] can represent "0" | " 1 "|" 2 "|" 3 "|" 4 "|" 5 "|" 6 "|" 7 "|" 8 "|" 9 ", [^0-9] means" characters other than numbers. " Regular expressions also have a variety of other rules to simplify our writing, but since this article is not "proficient in regular expression", we only retain some of the more essential operations to describe the lexical analysis principle.
Regular expressions are extremely expressive, and decimal rules can be used [0-9]+. [0-9] to indicate that the C-language annotation can be expressed as//* ([^/*]|/*+[^/*/]) */*+/.
3. Automatic machine with poor state
It's easier for people to read regular expressions, but machine reading regular expressions is a very difficult thing to do. Moreover, the direct use of regular expressions for matching, not only a large workload, but also slow. So we also need another type of expression specifically designed for machines. In a later chapter, we will give an algorithm to convert the regular expression into a form that the machine can read, which is the poor state automaton described in this chapter.
A poor state automaton is the name that sounds scary, but in reality the motivation is not as complicated as it might seem. This concept of state machines is widely used in a wide variety of fields. The Unified Modeling Language (UML) of software engineering has a state diagram and a state transition diagram in the digital logic. But these kinds of graphs are essentially different from state machines. I will use an example to describe the actual meaning of the state.
Suppose we now need to check whether the number of a in a string and the number of B are both even. Of course we can use a regular expression to describe it. But for this problem, using regular expressions is far less convenient than constructing state machines. We can design a set of states and then specify that one of the elements in the collection is a "starting state." In fact, the state is the state of the parser at the time the work has not started. The parser resets the state to the starting state each time a new job is done. The parser modifies the state every time a character is read, and we can specify the modified method. After reading all the characters, the parser is bound to stay in a certain state. If this state is consistent with the state we expect, we say that the parser accepts the string, or else we say that the parser rejects the string.
How to implement a parser by design state and its transfer method. Of course, if a string contains only a or B, then there are only four states of the parser: "Odd a odd", "odd a even number B", "even a odd-numbered", "even a even B". We name these states as AA, AB, AB, AB. Uppercase represents even, and lowercase represents odd. When the work has not started, the parser has read into the string is empty string, then the natural starting state should be AB. When the parser reads all the characters, the number of A and B in the string we expect to read is even, and the end state should also be AB. So we give a state diagram of this:
Figure 3.1
Check whether a string consists of an even number of a and even several B-a state diagram
In this state diagram, there is a short arrow pointing to AB, which represents the initial state of AB. The AB state has a thick edge that represents AB, which is an acceptable state for the end. The end state of a state graph can be one or more. In this example, the starting and ending states are exactly the same state. The arrow labeled "a" points from AB to AB, representing if the parser is in state AB and the character read is a, move to the state ab.
We applied this state graph to two strings, namely "Abaabbba" and "Aababbaba". Where the first string is acceptable and the second string is unacceptable (because there are 5 A and 4 B).
When parsing the first string, the state machine passes through a state that is:
AB [A]ab[b]ab[a]ab[a]ab[b]ab[b]ab[b]ab[a]AB
When parsing the second string, the state machine passes through a state that is:
AB [A]ab[a]ab [b]ab[a]ab[b]ab[b]ab[a]ab[b]ab [A]ab
The first string "Abaabbba" lets the state machine stop on state AB, so this string is acceptable. The second string "Aababbaba" lets the state machine stop on state AB, so this string is unacceptable.
We can use a simpler way to represent this state diagram inside the machine. This method simply records the set of arrows, start states, and end states of state and State. corresponding to this state diagram, we can represent this state diagram in the following form:
Starting state: AB
End State Collection: AB
(AB,A,AB)
(AB,B,AB)
(AB,A,AB)
(AB,B,AB)
(AB,A,AB)
(AB,B,AB)
(AB,A,AB)
(AB,B,AB)
There are sometimes deterministic and non-deterministic problems when using a state diagram to represent a state machine. The term "certainty" means that for any state, the input of one character can jump to another determined state. There is an intuitive description of the difference between certainty and non-deterministic: Any state of a state graph can have an indefinite number of edges pointing to another state, and if there are two sides inside these sides, and if they have the same characters, then this state can be entered into the other two states by the character, and the form The state machine is not certain. As shown in the figure:
Figure 3.2
A determined state machine representation of the regular expression ba*b
Figure 3.3
An indeterminate state machine representation of the regular expression ba*b
The starting state of the state machine in Figure 3.3 reads the character B and jumps to the middle two states, so the state machine is nondeterministic. In contrast, the state machine in Figure 3.2, although the function is consistent with the state machine in Figure 3.3, is OK. We can also use a special edge for state conversion. We use the ε side to indicate that a state can jump to another state without reading the character . The following figure shows an indeterminate state machine that contains the ε edge in accordance with figure 3.3:
Figure 3.4
A ba*b state machine with an ε-edge in the regular expression form
In textbooks, the determined state automaton (which is the state machine discussed in this article) is often referred to as a DFA, and an indeterminate state automaton is called an NFA, and an indeterminate machine with an ε-edge is called a Ε-NFA. These terms will also be used below to indicate the various types of poor state automata.
One of the usual questions about the existence of this side is the reason for the presence of ε. In fact, if people draw the state machine directly, sometimes can also directly draw a certain state machine, more complex can also draw an indeterminate state machine, in some extreme cases we need to use ε edge to more concise expression of our intentions. But the biggest reason for the existence of ε is that we can use the ε edge to give a concise algorithm to convert a regular expression into a ε-nfa.
4, from regular expression to Ε-nfa
We know that the basic element of a regular expression is the character set, as described in the second section. Through the operation of series, parallel, repetition and optional rules, we can construct a more complex regular expression. If a state machine is constructed from a regular expression, it can also be combined with these operations, and the method will become simple. Next we'll discuss one by one of these 5 constructs for regular expressions. All Ε-nfa that are constructed using the algorithm described below have only one end state .
1: Character Set
A character set is the most basic element of a regular expression, so it is reflected on the state diagram, and the character set is the basic element that makes up the state diagram. For character Set C, if there is a rule that accepts only C, the state diagram for this rule will be constructed in the following form:
Figure 4.1
The initial state of this state diagram is start, and the end state is the ending. Start state read in character set C jumps to end state and does not accept other character sets.
2: Tandem
If we use A⊙b to represent the concatenation of rule A and rule B, it is easy to know the concatenation of this operation is binding, that is to say (a⊙b) ⊙c=a⊙ (b⊙c). So for the concatenation of n rules, we only need to concatenate the first n-1 rules, then take the resulting rule as a whole, concatenate with the last rule, then we get the concatenation of all the rules. If we know how to concatenate two rules, we know how to concatenate n rules.
In order to convert two inline rules into a statechart diagram, we only need to convert the two rules into a statechart diagram, and then let the end state of the first state jump to the start state of the second state diagram. This jump must be a jump that does not read characters, that is, the two states are equivalent. Therefore, when the first state diagram jumps to the end state, it can be used as the starting state of the second state graph, and the second rule will be checked again. Therefore we used the ε-edge connection two state diagrams:
Figure 4.2
3: Parallel
Parallel methods are similar to series. In order to be able to read a character in the starting state to know that this character is likely to go in parallel with which branches and to jump, we need to first of all the branches of the state diagram constructed, and then connect the starting state to all branches of the starting state. Also, after a branch succeeds in accepting a string, in order for the state diagram to be at the end state of the entire state diagram, we will connect the end state of all the branches to the end state of the large rule. as follows:
Figure 4.3
4: Repeat
For a repetition, we can set up two states. The first state is the starting state, and the second state is the end state. When the state comes to an end state, if you encounter a string that can be accepted by the rule, return to the end state again. In this way, you can use a statechart diagram to indicate repetition. So for repetition, we can construct the state diagram as follows:
Figure 4.4
5: Optional
It is simpler to establish a statechart diagram for an optional operation. In order to complete the optional operation, we need to take a character and if the prefix of the string is accepted by the current rule, go to the state diagram of the current rule, if the subsequent rules of the optional rule accept the string then follow the state diagram of the subsequent rule, if it is accepted, then the two graphs must go. To achieve this goal, we connect the starting and ending states of the state diagram of the rule to the following state diagram:
Figure 4.5
If you reuse more than 0 repetitions, that is, the original repetition plus optional results, you can simply remove the start state of Figure 4.4, so that the end state has both the starting and ending states two roles, [Start] and [ends] remain the same.
So far, we have mapped 5 methods of constructing state diagrams to 5 kinds of construction rules. For any regular expression, we just need to restore the expression to the nesting of the 5 constructs, then we can convert a regular expression to a Ε-NFA by mapping each step to the construction of a state diagram. For example, we use regular expressions to express "a string containing only an even number A and even number B", and then convert it to Ε-NFA.
Let's analyze the problem first. If a string contains only an even number A and an even number of B, then the string must be of even length. So we can split this string into two two character segments. And there are only four kinds of characters: AA, BB, AB and Ba. For AA and BB, no matter how many times it appears, it will not affect the parity of the number A and B in the string (reason: in a modulo 2 addition system, 0 is invariant, that is, for any number that belongs to modulo 2 addition, there is x+0 = 0+x = x). For AB and BA, if the beginning and end of a string is AB or BA, and the middle part is any combination of AA or BB, the string also has an even number a and an even number B. We now have two methods of constructing a string of even number A and even number B. The series, parallel, optional, repetitive operations are applied to these strings, and a string with even number A and even number B is still obtained. So we can write regular expressions in the following form:
((AA|BB) | ((AB|BA) (AA|BB) * (AB|BA)) *
Based on the method mentioned above, we can convert this regular expression to the following State machine:
Figure 4.6
So far, we've got a way to convert a regular expression into a ε-nfa. But only get ε-nfa or not, because the Ε-NFA uncertainty is too big, directly according to Ε-nfa run, each time will get a large number of temporary state set, will greatly reduce efficiency. Therefore, we also need a way to eliminate the non-deterministic of a state machine.
5. Elimination of Non-deterministic
elimination ε Edge algorithm
We're seeing a poor state. There are three kinds of automata: Ε-nfa, NFA and DFA. Now we need to convert Ε-NFA to a DFA. ε edges are not possible in a DFA, so we first need to eliminate ε edges. The elimination of the ε-edge algorithm is based on a simple idea: if state a reaches State B via the ε edge, then state a can go directly to State B without having to read the characters. If State B needs to read the character x before it can reach state C, then state a read into X can also reach State C. Because the path from a to C is a B C, where A to B does not need to read characters.
So we get a very natural idea: to eliminate the ε-side from state A, just look for all the states that can be reached from the beginning of a only by the ε side, and copy the non-ε edges from those states to a. The rest of the work is to remove all ε edges and some states that become inaccessible because of the elimination of ε edges. In order to describe the elimination of the ε-edge algorithm more graphically, we construct a Ε-NFA from the regular expression (AB|CD) * and apply the elimination ε-edge algorithm to the state machine.
The status diagram for the regular expression (AB|CD) * is as follows:
Figure 5.1
1: Find all valid states.
The valid state is the state that still exists after the algorithm of eliminating the ε edge is completed. We can calculate all the valid states beforehand before we start the whole algorithm. The active state is characterized by the presence of non-ε-side input. At the same time, the starting state is also a valid state. The end state is not necessarily a valid state, but if there is a valid state that can reach the end state only by the ε edge, then this state should be marked as an end state. Therefore, there may be multiple end states for an NFA produced by an Ε-NFA application elimination ε-edge algorithm. But the starting state is still only one.
We can apply the "input or start state of a non-ε" method to every state in Figure 5.1, and calculate all the valid states in Figure 5.1. The results are shown in the following figure.
Figure 5.2
All inactive labels are deleted
If a state has both an ε edge and a non-ε-side input, then the state is still active. Because all valid states will get new output and new input in the next operation.
2: Add all the necessary sides
Next we have to apply an algorithm to all valid states. This algorithm is divided into two steps. The first step is to look for an ε closure of a state, and the second step is to see the ε closure of this state as a whole, copying all the edges from the closure to the current state. From the results of the mark valid state we have obtained Figure 5.1 the valid state set of the state diagram is {s/e 3 5 7 9}. We apply the above algorithm to these states in turn. The first step is to compute the ε closure of the s/e state. The so-called ε closure of a state is the set of all States from this state that can be reached only by the ε edge. The following figure marks the ε closure of the state s/e:
Figure 5.3
Now, we exclude the state s/e from the ε closure of the state s/e. Copying these edges is of no value because the non-ε side of the output from state a belongs to the non-ε side of the output from the ε closure of state A. The next step is to find the non-ε side of the output from the ε closure of the state s/e. In Figure 5.3 We can easily find that the side of a or B, which is output from State 1 and state 6 (see Figure 5.1 's Status label), is marked with a or a, which is what we are looking for. Next we copy these edges to the state s/e, and the target state of the edges remains unchanged, and the following figure can be obtained:
Figure 5.4
At this point, the application of the algorithm on the s/e is over, and then we separately to the remaining valid state {3 5 7 9} respectively apply this algorithm, you can get the following figure:
Figure 5.5
Red side for the newly added side
3: Delete all ε edges and invalid states
This step is the final step in eliminating the ε-edge algorithm. We only need to remove all ε edges and invalid states to complete the entire elimination ε-edge algorithm. Now we apply the third step to the state machine in Figure 5.5 and get the following state diagram:
Figure 5.6
However, it is not only the new edges that are not deleted. By definition, all non-ε edges starting from a valid state are edges that cannot be deleted.
By applying the elimination ε-edge algorithm to the state machine in Figure 5.1, we get the DFA in Figure 5.6. But not all of the ε-edge algorithm can directly get the DFA directly from the Ε-nfa , which is actually related to the regular expression itself. As for what regular expressions can achieve this effect here is not to delve into. But because it is possible to produce an NFA, we also need an algorithm to convert the NFA to a DFA.
from NFA to DFA
The NFA is non-deterministic state machine, and the DFA is a deterministic state machine. The biggest difference between certainty and uncertainty is the reading of a character from a state, the deterministic state machine getting a state, and the non-deterministic state machine getting a set of states. If we take the NFA's starting state S as a set {s}, for a state set S ', given an input, the corresponding state set T ' can be computed with the NFA. So when we construct the DFA, we just need to map the starting state to S ' and find all the sets of states that may appear at the same time as the NFA, converting these sets into a state of the DFA, and the task is done. Because the NFA state is finite, the number of elements of the power set of the set of all States of the NFA is also limited, so it is entirely possible to construct the DFA using this method.
To visualize the process of this algorithm, we construct a regular expression and then give the result that the regular expression is converted to NFA and apply the algorithm of constructing DFA to the NFA. Suppose a string has only a, B, and C three characters, judge whether a string starts with ABC and ends with the CBA the regular expression is as follows:
ABC (A|B|C) *CBA
With the above algorithm, you can get the NFA as shown in the following illustration:
Figure 5.7
Now we begin to construct the DFA, the specific algorithm is as follows:
1: Put {S} in queue L and set D. where S is the initial state of the NFA. The queue L places an unhandled DFA state that has been created, and the set D places an existing DFA state. According to the above discussion, each state of the DFA corresponds to some states of the NFA.
2: Remove a state from queue L, compute the set of character sets accepted by all sides of this state output, and then look for each character in the set to accept the edge of the character, and compute the set T of the object state of those edges. If t∈d refers to a known DFA state on behalf of the current character. Otherwise, the current character points to an not-created DFA state, at which point the T is placed in L and D. There are two layers of loops in this step: the first layer is the collection of all the accepted characters, and the second layer is the set of NFA states that are included in the target DFA state for each acceptable character traversing all output edges.
3: Skip to 2 if L is not empty.
Now we begin to apply the DFA construction algorithm to the state machine in Figure 5.7.
First, the first step, we get:
Figure 5.8
From top to bottom are the current states of Queues L, set D, and DFA. Just keep executing the algorithm until State 3 goes into the set D, we get:
Figure 5.9
{3} is now removed from queue L and the character set that is parsed to receive the state set {3} is combined with {a B c}. {3} reads A to reach {4}, reads into B arrives {5}, reads into C arrives {6 7}. Because {4}, {5}, and {6 7} do not belong to D, they are put into queue L and set D:
Figure 5.10
Taking 4 out of the queue to compute, we get:
Figure 5.11
Obviously, no new DFA state was found for processing of state {4}. Then processing {5} and {6 7}, we can get:
Figure 5.12
No new DFA state was found while processing state {5}, processing {6 7} jumps after input {A c} did not find a new DFA state, but we found {6 7} input B but got the new DFA state {5 8}. After executing the algorithm, we get a DFA:
Figure 5.13
At this point, the process of applying the DFA construction algorithm to the state machine of Figure 5.7 is over, and we get the DFA of Figure 5.13, whose function is equivalent to the NFA of Figure 5.7. In the DFA, the starting state is 0 and the end state is 4E. Any DFA state that contains an NFA end state must be an end state.
6, the use of regular expressions to construct the lexical analyzer
To determine whether a string belongs to a rule of the algorithm introduced here is the end. Back to the problem we started with, we need to use some rules. A long string is broken into a token, and then the corresponding rule for each token is computed. Before we solve this problem, let's look at what strings should be able to successfully be accepted by the lexical analyzer.
Assuming we now have rules A, B, C, and D, corresponding to the four notation types, the strings accepted by the lexical parser are any combination of a, B, C, and D, i.e. (a| b| c| D) *. If the input string is not nullable, the string that is accepted by the lexical analyzer is (a| b| c| D) +. But simply use (a| b| c| D) + as a rule to apply in the input string, we can only determine whether the string is the lexical parser can accept the string. So we need to modify this method.
First, according to the above method, the regular expressions of the rules corresponding to each token type are converted to the DFA, and then the parallel method is used to combine them, but the "repetition" is not used. But this time we're going to do a little bit of modification, and we're going to keep in mind the DFA state of the rules for each state of the new DFA .
Here's an example, we assume that we need a lexical analyzer for a simple language, and the rules are as follows:
i:[a-za-z_][a-za-z_0-9]*
n:[0-9]+
r:[0-9]+. [0-9]+
O:[=>+-*/|&]
Construct four DFA by rule and combine them:
Figure 6.1
We constructed a i|. n| r| O's DFA, and identifies which states contain the end state of the original DFA. One advantage of doing this is that when we put a string into a DFA like this, we wait for the entire string to be accepted or fail. If the string is accepted, we will write down the current end state. If it fails, we write down the state machine at the last end of the parse string. At this time, the corresponding notation type of the end state of the original DFA represented by the end state is the information we need. If you do not get any end state, the input string is not back lexical analysis of its acceptance.
For example, use the state machine above to analyze "123.ABC".
First, start with state 0, pass state n n N 2, and then declare failure. The last N (end state) and the then accepted string "123" are identified, and the result is (n, "123"). And then from ". ABC "starts with the first sign and fails, so". " is recognized as an unacceptable string. Finally enter "ABC", followed by state 0 1 I I, and then the string ends and is accepted, then output (I, "ABC").
Why do we not use "repetition" when we construct the state machine? Because every mark is recognized, we have to do some extra work. If we use "repetition" to construct the lexical Analyzer's state machine, we will not know the exact time when a token is identified.
The algorithm is basically over here, but there are a few small problems that need to be addressed in detail. In general, some of the rules that make up the lexical parser are rarely conflicting, but occasionally there is a case where the collection of strings represented by two rules is intersected. With the DFA tool, we can easily identify a rule violation.
If our lexical analyzer has a and B two states, then we construct the lexical analyzer a| b, you will get a new state that contains the end state of the DFA (A) and the DFA (b). We just need to check whether these states have the following characteristics to determine the relationship between A and B. We assume that the DFA (a) is the state machine for rule A, the DFA (b) is the state machine for rule B, and the DFA (L) is the lexical analyzer a| B's state machine:
1: If the DFA (L) has one or more states that contain both the end state of the DFA (a) and the DFA (b), then the strings represented by a and b intersect.
2: If the DFA (L) does not exist a state that contains the end state of the DFA (a) and the DFA (b), then the string represented by A and B does not have an intersection.
3: If some states of the DFA (L) contain the end state of the DFA (a), and none of these states contain the end state of the DFA (b), then A is a subset of B.
4: If some states of the DFA (L) contain the end state of the DFA (a), but these states do not contain the end state of the DFA (b) without exception, then A is not a subset of B.
In the lexical analyzer in Figure 6.1, we can clearly see that there is no intersection between the I, N, R, and o four rule 22. We can try to construct a conflicting rule and look at what the DFA of the lexical analyzer looks like:
Suppose the lexical analyzer contains the following rules:
A: "If"
b:[a-z]+
To a| b constructs the DFA, we will get the following state machine:
Figure 6.2
As we can see from Figure 6.2, this state diagram satisfies the above condition 3: States containing state A end states all contain the end state of B, so a is a subset of B. Obviously "if" is a subset of [a-z]+. When dealing with this conflicting rule, you can either make an error or select it according to a specified priority.