Thank you for your support. Today we have finally taken a step in practice. Today, we will use the VBF. Compilers lexical analysis library to develop the lexical analysis of miniSharp, a small language. MiniSharp is a subset of C #. The meaning of miniSharp is equivalent to the meaning of C. However, miniSharp only supports a few language features to reduce the difficulty of making compilers. MiniSharp has the following features:
- There is only one source file, and other dll files cannot be referenced (or even. NET class libraries ).
- No namespace.
- The first class must be a static class, and only one static method Main can be defined in it as the program entry.
- You can only define classes without enumeration, struct, interfaces, and delegation.
- Only private fields and non-static methods are supported. Virtual methods are not supported.
- The method must have a return value, except for the Main method.
- Only int, bool, int [], and custom classes are supported. Other types are not supported.
- Only one library function System. Console. WriteLine is supported, and the parameter is int.
- Only if-else statements, while statements, value assignment statements, variable declaration statements, and WriteLine statements are supported.
- Only +,-, *,/,>, <, =, &, |, And! are supported! Operator
- Each method can have only one return Statement, which must be the last statement of the method.
- Other C # features are not supported.
Everyone must think that this language is too castrated. I am not interested in generic expressions, Lambda expressions, and Linq. What strength do I have to write. But I advise you not to eat a fat man. If you write a large language, it will take a lot of time to go through the two steps of syntax analysis and semantic analysis, and may even encounter problems that have been plagued for a long time, as a result, we cannot quickly experience the back-end technology of the compiler. So let's start with a simple language, step by step. The basic principles are the same. Once you are familiar with it, you can add any features you want to add. NOTE: For the miniSharp design, refer to the miniJava language in Java of Hu Shu.
Today, let's first look at the lexical analysis of miniSharp. Words in miniSharp can be divided into the following five categories based on their priorities:
- Keywords
- Identifier
- Integer numeric constant
- Various punctuation marks
- Blank spaces, line breaks, and comments
You can understand the keywords. Identifiers are words that need to be carefully considered, because we want miniSharp to support variable or function names in Chinese like C #, so it certainly cannot start with an underscore or letter, followed by an underscore, letter, or number. Refer to the C # language specification. We need to use Unicode Character classification to define the identifier. There is no need to say anything about integer and punctuation. Finally, we will discuss the lexical rules of blank characters, line breaks, and comments.
First, we need to create a word type for each keyword in miniSharp. These keywords cannot be used as identifiers, so they are reserved words. The regular expressions of all keywords are connected by a string of characters. Therefore, we use RegularExpression's Literal method to define them:
var lex = lexicon.DefaultLexer;//keywordsK_CLASS = lex.DefineToken(RE.Literal("class"));K_PUBLIC = lex.DefineToken(RE.Literal("public"));K_STATIC = lex.DefineToken(RE.Literal("static"));K_VOID = lex.DefineToken(RE.Literal("void"));K_MAIN = lex.DefineToken(RE.Literal("Main"));K_STRING = lex.DefineToken(RE.Literal("string"));K_RETURN = lex.DefineToken(RE.Literal("return"));K_INT = lex.DefineToken(RE.Literal("int"));K_BOOL = lex.DefineToken(RE.Literal("bool"));K_IF = lex.DefineToken(RE.Literal("if"));K_ELSE = lex.DefineToken(RE.Literal("else"));K_WHILE = lex.DefineToken(RE.Literal("while"));K_SYSTEM = lex.DefineToken(RE.Literal("System"));K_CONSOLE = lex.DefineToken(RE.Literal("Console"));K_WRITELINE = lex.DefineToken(RE.Literal("WriteLine"));K_LENGTH = lex.DefineToken(RE.Literal("Length"));K_TRUE = lex.DefineToken(RE.Literal("true"));K_FALSE = lex.DefineToken(RE.Literal("false"));K_THIS = lex.DefineToken(RE.Literal("this"));K_NEW = lex.DefineToken(RE.Literal("new")); |
Here, lexicon is the instance created by the Lexicon class we introduced last time.
Next we will focus on the lexical feature of the identifier. We do not support identifiers starting with @ in C #, so we only consider one situation. C # Spec specifies that the start character of an identifier must be a letter or underscore. The "letter class" is not only a case character, but a character of the Lu, Ll, Lt, Lm, Lo, and Nl types in the Unicode classification. The meanings are as follows:
- Lu indicates uppercase letters, including uppercase letters in all languages.
- Ll indicates lowercase letters, including lowercase letters in all languages.
- Lt indicates the first letter (titlecase) of all words ).
- Lm indicates all modifiers ).
- Lo represents other letters, such as Chinese and Japanese characters.
- Nl indicates a number, but not a decimal number, but a letter. For example, a Roman number.
The identifier starts with a second character. In addition to letters and underscores, the following types of characters are allowed:
- Combination of class characters, Unicode classification of Mn and Mc
- Decimal number, Unicode classification Nd
- Connection class characters, Unicode classification Pc
- Format class characters, Unicode classification Cf
When using the VBF. Compilers. Scanners class library, you can use the RegularExpression. CharsOf method to generate the Union of Unicode characters with Lambda expressions. Currently, my design is not very efficient. Therefore, the syntax of miniSharp is slightly simplified. It can start with a letter or underline, then there are zero or multiple letter, underline, or number characters, that is, the combination class, connection class, and format class characters in the preceding definition are not supported. The regular expression for defining identifiers is written as follows:
var lettersCategories = new[] { UnicodeCategory.LetterNumber, UnicodeCategory.LowercaseLetter, UnicodeCategory.ModifierLetter, UnicodeCategory.OtherLetter, UnicodeCategory.TitlecaseLetter, UnicodeCategory.UppercaseLetter};var RE_IdChar = RE.CharsOf(c => lettersCategories.Contains(Char.GetUnicodeCategory(c))) | RE.Symbol('_');ID = lex.DefineToken(RE_IdChar >> (RE_IdChar | RE.Range('0', '9')).Many(), "identifier"); |
You can see that I used the Char. GetUnicodeCategory method in the. NET class library to judge the Unicode classification. In the future, VBF class libraries may provide direct support for Unicode classification. The following are Integer constants and punctuation marks. If there is nothing to say, you can directly look at the Code:
INTEGER_LITERAL = lex.DefineToken(RE.Range('0', '9').Many1(), "integer literal");//symbolsLOGICAL_AND = lex.DefineToken(RE.Literal("&&"));LOGICAL_OR = lex.DefineToken(RE.Literal("||"));LOGICAL_NOT = lex.DefineToken(RE.Symbol('!'));LESS = lex.DefineToken(RE.Symbol('<'));GREATER = lex.DefineToken(RE.Symbol('>'));EQUAL = lex.DefineToken(RE.Literal("=="));ASSIGN = lex.DefineToken(RE.Symbol('='));PLUS = lex.DefineToken(RE.Symbol('+'));MINUS = lex.DefineToken(RE.Symbol('-'));ASTERISK = lex.DefineToken(RE.Symbol('*'));SLASH = lex.DefineToken(RE.Symbol('/'));LEFT_PH = lex.DefineToken(RE.Symbol('('));RIGHT_PH = lex.DefineToken(RE.Symbol(')'));LEFT_BK = lex.DefineToken(RE.Symbol('['));RIGHT_BK = lex.DefineToken(RE.Symbol(']'));LEFT_BR = lex.DefineToken(RE.Symbol('{'));RIGHT_BR = lex.DefineToken(RE.Symbol('}'));COMMA = lex.DefineToken(RE.Symbol(','));COLON = lex.DefineToken(RE.Symbol(':'));SEMICOLON = lex.DefineToken(RE.Symbol(';'));DOT = lex.DefineToken(RE.Symbol('.')); |
It is slightly explained that the lexical function of an integer constant and the above identifier transfers an additional parameter when you call lex. DefineToken. This parameter is an optional description. If this parameter is not input, the regular expression string is used directly. The Regular Expression of the identifier has more than 40 thousand characters which are so long and unreadable. Therefore, add an additional string to describe it. It will be used to generate compilation error messages in the future.
At last, we will write the regular expressions for blank spaces, line breaks, and comments. These three are fully written in accordance with the C # spec specification. There are two types of Annotations: // starting with a line break until the comments start with/* and then start. You can learn how to write their regular expressions:
var RE_SpaceChar = RE.CharsOf(c => Char.GetUnicodeCategory(c) == UnicodeCategory.SpaceSeparator);WHITESPACE = lex.DefineToken(RE_SpaceChar | RE.CharSet("\u0009\u000B\u000C"));LINE_BREAKER = lex.DefineToken( RE.CharSet("\u000D\u000A\u0085\u2028\u2029") | RE.Literal("\r\n"));var RE_InputChar = RE.CharsOf(c => !"\u000D\u000A\u0085\u2028\u2029".Contains(c));var RE_NotSlashOrAsterisk = RE.CharsOf(c => !"/*".Contains(c));var RE_DelimitedCommentSection = RE.Symbol('/') | (RE.Symbol('*').Many() >> RE_NotSlashOrAsterisk);COMMENT = lex.DefineToken( (RE.Literal("//") >> RE_InputChar.Many()) | (RE.Literal("/*") >> RE_DelimitedCommentSection.Many() >> RE.Symbol('*').Many1() >> RE.Symbol('/'))); |
Finally, we have a bit of code to generate ScannerInfo from the Lexicon object and then generate the token:
ScannerInfo info = lexicon. createScannerInfo (); plaintext = new plaintext (info); string source = "// any miniSharp source code"; StringReader sr = new StringReader (source); plaintext. setSource (new SourceReader (sr); sources. setSkipTokens (WHITESPACE. index, LINE_BREAKER.Index, COMMENT. index ); |
This is done! We have created a complete miniSharp lexical analyzer. Now it can analyze all the miniSharp source code. Note that we have set the lexical analyzer to ignore all blank characters, line breaks, and comments for the convenience of subsequent syntax analysis. Readers can try to expand the lexical analyzer at will, such as adding the lexical feature of string constants, more keywords and operators, and even new words that have never been seen before. Have a good time! At the beginning of the next article, we will enter another important link-the syntax analysis section.
Also don't forget to pay attention to my VBF project: https://github.com/Ninputer/VBF and my weibo: http://weibo.com/ninputer thank you for your support!