Now the core of the DFA has been successfully constructed, the final step is based on the DFA to obtain a complete lexical analyzer.
Since it is not yet possible to support a lexical definition file like Flex, you still need to define the rules in your program and not be very flexible in customizing the lexical analyzer, but the basics are sufficient.
Definition of lexical rules
All the rules used by the lexical parser are defined in the Grammar<t> class, where the generic parameter T represents the type of identifier for the lexical analyzer (must be an enumerated type). The defining rule method includes the Definecontext method that defines the context, the Defineregex method that defines the regular expression, and the Definesymbol method that defines the non-terminal.
The lexical parser context that invokes the Definecontext method definition is represented using an instance of the Lexercontext class, and its basic definition is as follows:
The index of the current context.
int Index;
The label for the current context.
string Label;
The type of the current context.
Lexercontexttype ContextType;
In the lexical analyzer, the context can only be toggled by a label, so the Lexercontext class itself is set to internal.
The type of context is either an inclusion or an exclusion, equivalent to the%s and%x definitions in flex (see Flex's Start Conditions). The simple explanation here is that, in the lexical analysis, if the current context is excluded, then the rules defined only in the current context are activated, and other rules (as defined in the current context) will not expire. If the current context is a containing type, then a rule that does not specify any context will also be activated.
The default context label is "Initial".
The Defineregex method for defining regular expressions in grammar<t> is equivalent to the definition segment in Flex (Definitions section), which defines some common regular expressions to simplify the definition of the rule. For example, you can define
Grammar. Defineregex ("digit", "[0-9]");
In the definition of a regular expression, you can use "{digit}" directly to refer to a predefined regular expression.
Finally, the Definesymbol method of defining non-terminal, corresponding to the rule sectionin Flex, defines the non-terminal regular expression and the corresponding action.
Non-terminal's actions are expressed by using action<readercontroller<t>>, which is provided by readercontroller<t> classes to provide Accept,reject,more and other methods.
Where the Accept method accepts the current lexical unit and returns the Token object. The Reject method rejects the current match and instead looks for the suboptimal rule, which slows down all matching of the lexical analyzer and requires careful use. The more method notifies the lexical analyzer that the next time the match succeeds, it does not replace the current text, but appends the new match to the back.
The Accept method and the Reject method are conflicting, and only one of them can be invoked each time a match succeeds. If two are not invoked, the lexical parser will assume that the current match is successful, but will not return Token, but continue to match the next lexical unit.
Second, the realization of lexical analyzer
2.1 Basic Lexical analyzer
Because multiple rules can be conflicting, such as strings that can match multiple regular expressions, you first need to define a rule to resolve the conflict before you explain the lexical analyzer. This applies to the same rules as Flex:
Always select the longest match.
If the longest match matches more than one regular expression, always select the regular expression that is first defined.
The basic lexical analyzer is very simple, it can only achieve the most basic lexical analyzer function, can not support the forward-looking symbol and Reject action, but in most cases, this is enough.
Such a lexical analyzer is almost equivalent to a DFA actuator, as long as the characters are continuously read from the input stream into the DFA engine, and the last occurrence of the acceptance state is recorded. When the DFA engine reaches the dead state, the found morpheme is the last occurrence of the accepted state corresponding to the symbol (so as to ensure that the found morpheme is the longest), for multiple symbols, only the first one (the symbol index has been sorted since small to large), so the first symbol is the first symbol defined.