In the previous six articles, I have introduced in detail the algorithms related to the lexical analyzer. They all focus on implementation details and may be messy. This article introduces how to define lexical analyzer as a whole and how to implement your lexical analyzer. Section 2 describes how to define a lexical analyzer. It can be used as a guide for lexical analyzer. If you do not care about the specific implementation of the lexical analyzer, you can only look at section 2. 1. The changes to the class library should first describe some of the changes I made to the class library. The lexical analysis interface has changed a lot compared with the C # lexical analyzer series. It is necessary to explain it. 1. The identifier of the lexical unit (token) is initially defined as a Token structure and uses an int attribute as the identifier of the lexical unit. This is also a common practice of many lexical analyzers. However, it was inconvenient to do the syntax analysis later. Because at present, it does not support generating lexical analyzer code from the definition file, you can only define the lexical analyzer in the program. Int itself is non-semantic and used as the identifier of the lexical unit, which is not only inconvenient but also prone to errors. Later, I tried to use a string as the identifier. Although the semantic problem is solved, it is still prone to errors and the implementation will be more complicated (the string dictionary needs to be saved ). The simple and semantic Solution uses enumeration. Enumeration names provide semantics, enumeration values can be converted to integers, and compilation checks are also provided to completely avoid spelling errors, therefore, the current lexical unit is defined as a Token <T> class, and many related classes also carry the generic parameter T. 2. the namespace before the namespace is Cyjb. compiler and Cyjb. compiler. lexer, now changed to Cyjb. compilers and Cyjb. compilers. lexers, after all, namespace names are more suitable for the use of plural. 3. Before the lexical analyzer context switch, you can use the context index, Tag, or LexerContext instance itself. But now we can only switch through tags, which is easier to implement and will not be affected too much in use. 4. DFA indicates that the DFA representation in the LexerRule class is a little simple and crude. It is difficult for people who do not know the specific implementation to understand the DFA representation. Now we have re-planned the interfaces in the LexerRule <T> class, which makes it easier to understand. Ii. Defining lexical analyzer this section is the guide for using lexical analyzer in the Cyjb. Compilers class library. It contains complete documents, instances, and related precautions. The source code of the class library can be found in the Cyjb. Compilers project. For the class library documentation, see wiki. 1. Define the identifier of the lexical unit. As mentioned above, the enumeration type is used as the identifier of the lexical unit. The fields in this enumeration type can be defined at will without any restrictions. However, to facilitate subsequent syntax analysis, the enumerated value must be an integer starting from 0, and the enumerated value should be continuous, non-consecutive enumeration values may waste more space for syntax analysis. The special value-1 is used to indicate the end of the file (EndOfFile). This value can be used from Token <T>. the EndOfFile field can also be obtained through the Token <T>. the IsEndOfFile attribute is used to determine whether the lexical Unit indicates the end of the file. Here, the calculator is still used as an example. The following code defines the enumeration as an identifier: it is obviously easier to use than an integer. 2. define all definitions of the lexical analyzer context lexical analyzer from Cyjb. compilers. grammar <T> class starts, so you must first instantiate an instance of the Grammar <T> class: the context of the lexical analyzer, which can be used to control whether the rule takes effect. There are two types of Context: include or exclude. If it is an inclusive context, all rules of the current context are activated, and all rules without any context are activated. If the context is excluded, only all rules in the context will be activated, and none of the other rules will be activated. Use the following methods to define the context of the excluded and contained lexical analyzer. The label parameter is the context Tag: the default lexical analyzer context is "Initial ", you can use this label to switch to the default context. Note that, for implementation reasons, the context must be defined prior to all terminologies. For example, the following code defines an inclusive context Inc and an excluded context Exc. 3. to define a regular expression, use the following method to define a regular expression: a regular expression can be defined using Cyjb. compilers. regularExpressions. related methods of the Regex class can be constructed, or strings representing regular expressions can be directly used. For the definition rules, see C # lexical analyzer (3) regular expressions. Note: The regular expressions defined here are only used to simplify the terminator definition, so that you can reuse some common or complex Regular Expressions without any other functions. The regular expressions defined here cannot contain forward-looking symbols (/), the first line qualifier (^), the last line qualifier ($), or the context (<context> ). For example, the following code defines a regular expression called digit. When you need to represent a number in the future, you can directly introduce it through "{digit, instead of writing "[0-9] +" every time ". 4. Define Terminator use the related overload of the Grammar <T>. DefineSymbol method to define Terminator, as shown in the following code: These reloads are divided into three groups. The first group is reloaded, and T id is accepted as the identifier corresponding to the lexical unit, and the corresponding regular expression and context. When the final is matched, the Token <T> instance with the id is automatically returned. The second group is reloaded and has an additional parameter action. This is a delegate that only contains one ReaderController <T> parameter. It is called when the corresponding Terminator is matched. The corresponding attributes and methods of ReaderController <T> can be used to control the lexical analysis process. If the last reload is missing the identifier id, the Token <T> instance cannot be returned automatically. Therefore, you must specify the method to be executed when the corresponding Terminator is matched. When a Terminator is successfully matched with a Terminator, it executes the corresponding Action, which is an Action <ReaderController <T> type delegate. The ReaderController <T> class contains information related to the currently matched Terminator, including context, identifier, source file, and text. The main methods include Accept, More, Reject, and incontext, PopContext, and PushContext. The Accept method accepts the current match, and the lexical analyzer returns the Token <T> instance that indicates the current match. The More method will notify the lexical analyzer to retain the matched text. Assume that the text for this match is "foo" and the text for the next match is "bar". If the More method is called during this match, the text for the next match will become "foobar ". The Reject method rejects the current match and uses the sub-optimal rule to continue matching. For details, see section 2.4 "Support for Reject-initiated lexical analyzers" in "C # lexical analyzer (6) structure lexical analyzer ". The Accept and Reject methods cannot be called at the same time in a match, because they are mutually exclusive actions. If neither method is called during a match, the lexical analyzer does nothing -- discards the matching result and performs the next match directly. For the context control of the lexical analyzer, a simple usage is to use context to switch matching rule sets to implement some "subsyntaxes". For details, refer to C # lexical analyzer (6) section 3.3 of the lexical analyzer section describes the example "Escape string ". The Terminator definition of the calculator is given below. Specifically, the Id is defined by introducing the regular expression digit, and it defines its own action and converts its own text to the double type, and save it to Token <T>. value attribute. In the last statement, the blank space is discarded by defining the null action. 5. the lexical analyzer is defined in the preceding four steps. The next step is to construct the lexical analyzer. Using the following four methods, you can directly construct the corresponding lexical unit reader (an example of the TokenReader <T> subclass): If the GetReader method is called for overload, it is considered that the action does not contain a Reject and will return a simpler but more efficient lexical analyzer implementation. If the GetRejectableReader method is called for overload, it is considered that the dynamic include rejection (Reject) will return a more powerful but less efficient lexical analyzer implementation. The rule is: if you do not include forward and reject actions, the instance of SimpleReader <T> is returned. If it only contains a fixed-length forward (does not contain a variable-length forward or reject action), the instance of FixedTrailingReader <T> is returned. If only the deny action is included (not the forward-looking action), the instance of RejectableReader <T> is returned. If the instance contains a longer-looking forward, or both a denial action and a forward-looking (whether or not), the instance of RejectableTrailingReader <T> is returned. For details about the implementation, refer to "C # lexical analyzer (6) structure lexical analyzer". All lexical unit readers are inherited from the TokenReader <T> class and mainly contain two methods: PeekToken and ReadToken, which are the same as the literal meaning, that is, reading the next lexical unit in the input stream, do not change the character location of the (Peek) or (Read) input stream. The TokenReader <T> class also implements the IEnumerable <T> interface, so you can use the foreach statement to read the lexical unit. However, TokenReader <T> itself does not store the previously read lexical units. When being enumerated, it will actually call the ReadToken method, therefore, TokenReader can only be enumerated at one position <T> and can only be enumerated once. After the enumeration is completed, TokenReader <T> also reaches the end of the stream. If you want to enumerate multiple times, cache it to the array for further operations. The following code constructs the lexical unit reader of the calculator and outputs the code of all the read lexical units. Finally, the complete code for constructing the calculator is shown in the output result of the Code, the end is always ended with a special value-1, indicating the end of the file. Iii. Complete lexical analyzer implementation is provided in the Custom lexical analyzer Cyjb. Compilers project. However, in actual use, it is inevitable that you will encounter a variety of requirements. The implemented lexical analyzer may not be able to meet the requirements. At this time, you must complete the lexical analyzer yourself. After defining the lexical analyzer, you can start with Grammar <T>. the LexerRule attribute gets a Cyjb. compilers. lexers. lexerRule <T> object, which stores all the information required by a lexical analyzer and does not depend on the original Grammar <T> object. It is the core of the custom lexical analyzer. Is a class diagram related to the LexerRule <T> object. These four classes represent the core information of the lexical analyzer, that is, the generated DFA data.