C # lexical analyzer (6) construct lexical analyzer

Source: Internet
Author: User
Tags lexer
ArticleDirectory
    • 2.1 Basic lexical analyzer
    • 2.2 lexical analyzer supporting fixed-length forward symbols
    • 2.3 lexical analyzer supporting extended forward symbols
    • 2.4 support for reject-initiated lexical analyzer
    • 3.1 Calculator
    • 3.2 string
    • 3.3 escape string

Series navigation

    1. (1) Introduction to lexical analysis
    2. (2) Input buffering and code locating
    3. (3) Regular Expressions
    4. (4) construct NFA
    5. (5) DFA Conversion
    6. (6) construct a lexical analyzer

At present, the core DFA has been successfully constructed. The last step is to obtain a complete lexical analyzer based on DFA.

Because it does not support lexical definition files as flex does, you still needProgramBut the basic things are enough.

I. Definition of lexical rules

All the rules used by the lexical analyzer are stored in the grammar class definition. Its main attributes include:

// Default lexical analyzer context. Lexercontext initialcontext; // list of lexical analyzer context. Lexercontextcollection contexts; // list of defined regular expressions. Idictionary <string, RegEx> regexs; // list of defined terminologies. Symbolcollection <terminalsymbol> terminalsymbols;

This class also contains some auxiliary methods, including the definecontext method for defining the context, the defineregex method for defining the regular expression, and the definesymbol method for defining the Terminator.

The context of the lexical analyzer is defined as the lexercontext class. Its basic definition is as follows:

 
// Index of the current context. Int index; // The tag of the current context. String label; // The type of the current context. Lexercontexttype contexttype;

In the lexical analyzer, context can be switched through indexes, tags, or lexercontext objects.

The context type is the inclusion or exclusion type, which is equivalent to % s and % x definitions in Flex (see start conditions in flex ). In this example, if the current context is excluded during lexical analysis, the rules defined only in the current context will be activated, other (not defined in the current context) Rules will become invalid. If the current context is inclusive, rules that do not specify any context will also be activated.

The default context index is 0 and the label is "initial ".

The regexs attribute in grammar is equivalentDefinitions Section)You can define common regular expressions to simplify the definition of rules. For example, you can define

 
Grammar. defineregex ("digit", "[0-9]");

In the definition of a regular expression, you can directly use "{digit}" to reference a pre-defined regular expression.

The final is the terminalsymbols in flex.Rule Section)To define the Regular Expression and corresponding action of the Terminator.

Here, actions are defined by Action <readercontroller>. The readercontroller class provides methods such as accept, reject, and more.

The accept method accepts the current lexical unit and returns the token object. The reject method rejects the current match and looks for sub-optimal rules. This operation will cause the lexical analyzerAllThe matching is slow and must be used with caution. The more method notifies the lexical analyzer that, when the next match is successful, the current text will not be replaced, but the new match will be appended to the back.

The accept method and reject method conflict with each other, and only one of them can be called for each successful match. If neither of them is called, the lexical analyzer considers that the current match is successful, but does not return the token, but continues to match the next lexical unit.

2. Implementation of lexical analyzer 2.1 Basic lexical analyzer

Multiple rules may conflict with each other. For example, strings can match multiple regular expressions. Therefore, before defining the lexical analyzer, you must first define a rule to resolve the conflict. The same rules as flex are used here:

    1. Always select the longest match.
    2. If the longest match matches multiple regular expressions, always select the first defined regular expression.

The basic lexical analyzer is very simple. It can only implement the most basic lexical analyzer function and cannot support forward-looking symbols and reject actions. However, in most cases, this is enough.

Such a lexical analyzer is almost equivalent to a DFA executor, as long as the characters are constantly read from the input stream and sent to the DFA engine, and the last acceptance status is recorded. When the DFA engine reaches the dead state, the found phoneme is the symbol corresponding to the last occurrence of the Acceptance state (this ensures that the found phoneme is the longest ), when there are multiple symbols, only the first one is taken (the symbol index has been sorted from small to large, so the first symbol is the first defined symbol ).

SimpleAlgorithmAs follows:

 
Input: DFA $ d $ S = s_0 $ while (C! = EOF) {$ S = d [c] $ if ($ s \ In finalstates $) {$ S _ {last} = S $} c = nextchar ();} $ S _ {last} $ is the matched word base.

Implement this algorithmCodeThe core code of the simplereader class is as follows:

// The Last matched symbol and text index. Int lastaccept =-1, lastindex = source. index; while (true) {int CH = source. read (); If (CH =-1) {// end of file. Break;} state = base. lexerrule. transitions [State, base. lexerrule. charclass [CH]; If (State = lexerrule. deadstate) {// there is no suitable transfer. Exit. Break;} int [] symbolindex = base. lexerrule. symbolindex [State]; If (symbolindex. length> 0) {lastaccept = symbolindex [0]; lastindex = source. index ;}} if (lastaccept> = 0) {// adjust the stream to a status that matches the acceptance status. Source. unget (source. Index-lastindex); doaction (base. lexerrule. Actions [lastaccept], lastaccept, source. Accept ());}
2.2 lexical analyzer supporting fixed-length forward symbols

Next, we will extend the above basic lexical analyzer to support fixed-length forward symbols.

The rule form of forward-looking symbols is $ r = S/T $. If the length of a string that can be matched by $ S $ or $ T $ is fixed, it is called a forward-looking symbol with a fixed length; if they are not fixed, they are called long-looking forward symbols.

For example, regular expressions such as ABCD or [A-Z] {2} can match strings with fixed lengths, 4 and 2, respectively; the regular expression [0-9] + can match the string length is not fixed, as long as it is greater than or equal to one is possible.

Differentiate the forward symbols with fixed length and variable length because the forward symbols with fixed length are easier to match. For example, if the regular expression A \ */BCD recognizes this mode, it returns three characters and finds the end position of.

For the rule ABC/D *, after recognizing this mode, the system rolls back to the end position of ABC with only three characters left.

I calculated the length of the string that the forward symbol can match in advance and stored in Int? [] In the trailing array, null indicates that it is not a forward sign, positive number indicates that the length of the Front ($ S $) is fixed, and negative number indicates that the length of the back ($ T $) is fixed, 0 indicates that the length is not fixed.

Therefore, you only need to judge the trailing value after the normal match. If it is null, it is not a forward-looking symbol, and no operation is required. If it is positive number N, the first n digits of the currently matched string are taken as the actually matched string; if the value is negative-N, the last n digits are taken as the actually matched strings. The implemented code can be seen in the fixedtrailingreader class.

2.3 lexical analyzer supporting extended forward symbols

For long-looking forward symbols, the processing is more complicated. Because you cannot determine where the forward header is (there is no definite length), you must use the stack to save all the accepted States encountered and locate them along the stack until you find the request containing the int. the status of maxvalue-symbolindex (this is how I differentiate the forward header status. For details, see section 2.4 DFA status conversion in C # lexical analyzer (5 ).

It should be noted that there are ** restrictions ** for Long-looking forward symbols, such as regular expressions AB \ */BCD \*, at this time, the end position of AB \ * cannot be accurately found, but the position of the last B will be found, resulting in the final match of the word is not the desired one. The reason for this is that DFA is used for string matching. This problem occurs as long as the end of the first part matches the start of the last part, therefore, avoid defining such regular expressions.

The implemented code can be seen in the variabletrailingreader class. The code for finding the desired header status along the status stack is as follows:

// Statestack is the status stack int target = int. maxvalue-acceptstate; while (true) {astate = statestack. pop (); If (containstrailinghead (astate. symbolindex, target) {// The target status is found. Break;} // The containstrailinghead method uses the ordered symbol index to avoid unnecessary comparisons. Bool containstrailinghead (INT [] symbolindex, int target) {// search in the current status, from the back to the front. For (INT I = symbolindex. length-1; I> = 0; I --) {int idx = symbolindex [I]; // the previous status is no longer possible, so you can exit directly. If (idx <base. lexerrule. symbolcount) {break;} If (idx = target) {return true ;}} return false ;}

When looking for a forward header along the stack, you don't have to worry about the status. During DFA execution, the forward header status will definitely appear before the forward status.

2.4 support for reject-initiated lexical analyzer

The reject action instructs the lexical analyzer to skip the current matching rule and find the sub-optimal rule that matches the same input (or the input prefix.

For example:

G. definesymbol ("A", c => {console. writeline (C. text); C. reject () ;}); G. definesymbol ("AB", c => {console. writeline (C. text); C. reject () ;}); G. definesymbol ("ABC", c => {console. writeline (C. text); C. reject () ;}); G. definesymbol ("ABCD", c => {console. writeline (C. text); C. reject () ;}); G. definesymbol ("BCD", c => {console. writeline (C. text) ;}); G. definesymbol (". ", C => {});

Match the string "ABCD" and the output result is:

 
Abcdabcababcd

The matching process is as follows:

The first time the rule "ABCD" is matched, the string "ABCD" is output and reject is returned.

Therefore, the lexical analyzer tries sub-optimal rules, that is, the 3rd rules "ABC", then outputs the string "ABC" and reject.

Next, try the sub-optimal rule, that is, the 2nd rule "AB", then output the string "AB" and reject.

Continue with the sub-optimal rule, that is, the 1st rule "A", then output the string "a" and reject.

Then, continue to try the sub-optimal rule, that is, the 6th rule ".", then the string "a" is matched successfully.

Finally, the remaining string "BCD" exactly matches rule 5, so "BCD" is output directly ".

To achieve this, we also need to use the stack to save all the accepted States encountered. If the current match is reject, we will find a sub-optimal match along the stack. The implemented code can be seen in the rejectablereader class.

The above four sections describe the basic structure of the lexical analyzer and some functions. The Lexical analyzer that implements all functions implements the visible rejectabletrailingreader class.

Iii. Examples of lexical analysis

Next, I will provide some practical usage of the lexical analyzer, which can be used as a reference.

3.1 Calculator

First, I will provide a complete code for the lexical analysis program of the calculator. The subsequent example will only contain the definition of the rule.

Grammar G = new grammar (); G. definesymbol ("[0-9] +"); G. definesymbol ("\ +"); G. definesymbol ("\-"); G. definesymbol ("\ *"); G. definesymbol ("\/"); G. definesymbol ("\ ^"); G. definesymbol ("\ ("); G. definesymbol ("\)"); // eat all the spaces. G. definesymbol ("\ s", C =>{}); lexerrule lexer = G. createlexer (); string text = "456 + (98-56) * 89/-845 + 2 ^ 3"; tokenreader reader = lexer. getreader (text); While (true) {try {token = reader. readtoken (); If (token. index = token. endoffileindex) {break;} else {console. writeline (token) ;}} catch (sourceexception SE) {console. writeline (SE. message) ;}} // The output is: token #0 456 token #1 + token #6 (token #0 98 token #2-Token #0 56 token #7) token #3 * token #0 89 token #4/token #2-Token #0 845 token #1 + token #0 2 token #5 ^ token #0 3
3.2 string

The following example can match any string, including a normal string and a verbatim string (such ). Because all strings in the Code use a verbatim string, there are many double quotes, so the number must be clear.

G. defineregex ("regular_string_character", @ "[^" "\ n \ r \ u0085 \ u2028 \ u2029] | (\\.) "); G. defineregex ("regular_string_literal", @ "\" "{regular_string_character} * \"); G. defineregex ("verbatim_string_characters", @ "[^" "] | \" "\"); G. defineregex ("verbatim_string_literal", @ "@ \" "{verbatim_string_characters} * \"); G. definesymbol ("{regular_string_literal} | {verbatim_string_literal }"); string text = @ "" ABCD \ n \ r "AABB \" "CCD \ u0045 \ x47" @ "ABCD \ n \ r ""@"" AABB \ "CCD \ u0045 \ x47 """; // The output is: token #0 "ABCD \ n \ r" token #0 "AABB \" CCD \ u0045 \ x47 "token #0 @" ABCD \ n \ r "token #0 @" AABB \ "" CCD \ u0045 \ x47"
3.3 escape string

The following example uses context to not only match any string, but also escape the string.

G. definecontext ("str"); G. definecontext ("vstr"); G. definesymbol (@ "\", c => {C. pushcontext ("str"); textbuilder. clear () ;}); G. definesymbol (@ "@ \", c => {C. pushcontext ("vstr"); textbuilder. clear () ;}); G. definesymbol (@ "<STR> \", c => {C. popcontext (); C. accept (0, textbuilder. tostring (), null) ;}); G. definesymbol (@ "<STR >\\ U [0-9] {4}", c => textbuilder. append (char) int. parse (C. text. substring (2), numberstyles. hexnumber); G. definesymbol (@ "<STR >\\ X [0-9] {2}", c => textbuilder. append (char) int. parse (C. text. substring (2), numberstyles. hexnumber); G. definesymbol (@ "<STR >\\ N", c => textbuilder. append ('\ n'); G. definesymbol (@ "<STR >\\", c => textbuilder. append ('\ "'); G. definesymbol (@ "<STR >\\ R", c => textbuilder. append ('\ R'); G. definesymbol (@ "<STR>. ", c => textbuilder. append (C. text); G. definesymbol (@ "<vstr> \", c => {C. popcontext (); C. accept (0, textbuilder. tostring (), null) ;}); G. definesymbol (@ "<vstr> \" "\", c => textbuilder. append ('"'); G. definesymbol (@ "<vstr>. ", c => textbuilder. append (C. text )); string text = @ "" ABCD \ n \ r "AABB \" "CCD \ u0045 \ x47" @ "ABCD \ n \ r ""@"" AABB \ "CCD \ u0045 \ x47 """; // The output is: token #0 abcdtoken #0 AABB "ccdegtoken #0 ABCD \ n \ rtoken #0 AABB \" CCD \ u0045 \ x47

As you can see, the output result here is exactly the result after escaping the output result in section 3.2. Note that C. the accept () method modifies the lexical unit to be returned, and because multiple escaping is involved, pay attention to the number of double quotation marks and backslash when designing rules.

Now, the complete lexical analyzer has been constructed, and this series is now here. Relevant code can be found here, and some basic classes (such as input buffer) are here.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.