This is a job in the compilation principles course. It is required to compile a compiler for the tiny Language extension tiny +. The first stage is to implement lexical analysis for the tiny + language. First, briefly explain the composition of tiny + language:
Tiny + We define here a programming language called tiny +, which is a superset of tiny in that it should des declarations, if statement, do-while statement, string type and so on. the following consists of: 1 lexical conventions of the language, including a description of the tokens of the specified age2 ebnf description of each language construct3 an description of the main semantics4 sample programs in Tin Y + Part 1 lexical conventions of tiny + 1. the keywords of the language are the following: true False or and not int bool string while do if then else end repeat until read write all keywords are reserved and must be written in lowcase2. special symbols are the following :><==, '{};: = +-*/() <= 3. other tokens are ID, num and string which are defined by the following regular expressions: I D = letter (letter | digit) * identifier is letter followed by letters and digitsnum = digit * string = 'any character t' a string is enclosed in brackets '... ', Any character T' can appear in a string. A string can' t be defined more than a lineletter = A |... | Z | A |... | Zdigit = 0 |... | 9 lower and uppercase letters are distinct4. white space consists of blanks, newlines and tabs. white space is ignored before t that it must separate IDS, Nums, and keywords5. comments are enclosed in curly brackets {...} And cannot be nested. comments can include more than one line. part 2 Syntax of tiny + an ebnf grammar for tiny + is as follows: 1 program-> declarations stmt-sequence2 declarations-> Decl; declarations | ε 3 Decl-> type-specifier varlist4 type-specifier-> int | bool | string5 varlist-> identifier {, identifier} 6 stmt-sequence-> statement {; statement} 7 Statement-> If-stmt | repeat-stmt | assign-stmt | read-stmt | write-stmt | while-stmt8 while-stmt-> while bool-exp do stmt-Sequence end9 if-stmt-> If bool-exp then stmt-sequence [else stmt-sequence] end10 repeat-stmt-> repeat stmt-sequence until bool-exp11 assign-stmt-> identifier: = exp12 read-stmt-> Read identifier13 write-stmt-> write exp14 exp-> arithmetic-exp | bool-exp | string-exp | comparison-exp15 comparison-EXP-> arithmetic-exp comparison-op arithmetic-exp16 comparison-op-> <| = | >|>=| <= 17 arithmetic-EXP-> term {addop term} 18 addop-> + |-19 term -> factor {mulop factor} 20 mulop-> * |/21 factor-> (arithmetic-exp) | Number | identifier22 bool-EXP-> bterm {or bterm} 23 bterm-> bfactor {and bfactor} 24 bfactor-> true | false | identifier | (bool-exp) | not bfactor | (comparison-exp) 25 string-EXP-> string | identifier26 Part 3 main semantics description of tiny + A program consists of variable declarations and a sequence of statements. variable declarations may be empty but there must be at least one statement. all variables must be declared before they are used, and each variable name can be declared only once the type of variables and expressions may be int, bool or string, type checking must be done on thempart 4 sample programs in tiny + String STR; int X, fact; STR: = 'sample program in tiny + language-computes factorial '; read X; if X> 0 and x <100 then {don't compute if x <= 0} Fact: = 1; while x> 0 do fact: = fact * X; X: = X-1 end; write factend
From the above we can see the keywords about tiny +, data definition, lexical definition, syntax definition, and so on. Finally, record your learning process:
Tiny + development environment:
Intel processor
Microsoft Windows 7 Operating System
Visual Studio 2010 and Enterprise version. Net 4
The tiny + compiler uses a graphical operation window instead of an input/output mode based on the command control console. This makes interaction more user-friendly. The Compiler provides the following functions:
Lexical analysis module;
Syntax analysis module;
Word/syntax analysis result saving module;
Text editing module;
Advanced function module;
CodeInput module;
Analysis result output module;
Tiny + compiler CoreAlgorithm
Token: the core function of this part is the gettoken () function. In this function, it first uses a loop to continuously call getnextchar () (and the corresponding getnextchart () function) to determine how to obtain tiny +Source code(One character each time), and then convert to the new status according to the "current status-input character" in the status conversion table, then, based on the "current shape-current input character" in the advance table, determine whether to accept new characters. If yes, accept new characters. Otherwise, determine the state of the current state variable, check whether it is unacceptable and whether it is an error. If the current status is acceptable, indicating that the current tokenstring is already a token, you can exit the loop, or the current status is incorrect, you can also exit the loop. If the State is unacceptable and the state is not incorrect, it can continue to accept characters.
The Code is as follows:
While (! Acceptlist [State] & state! = Merror) {CH = getnextchart (c); newstate = (satetype) (translatetable [(INT) State]) [CH]; if (tokenstringindex <maxtokenlenth & // whether to save the current character to newstate in the current // tokenstring! = Start & newstate! = Incoment) tokenstring [tokenstringindex ++] = C; If (advancetable [int (state)]) [CH] = 1) // whether the next character c = getnextchar (); oldstate = State; State = newstate;} can be accepted ;}
After exiting the loop, you can determine whether the current state is acceptable. If yes, you can determine whether the current state is rolldone or notrolldone.
1). If it is rolldone: Roll back a character. That is, the last character read is not saved.
2). If it is notrolldone: you do not need to roll back a single character.
Then, the current token type is determined based on the state before it is converted to the done (rolldone quit notrolldone) state.
If (acceptlist [State]) {If (State = rolldone) {tokenstring [tokenstringindex-1] = '\ 0'; ungetnextchar ();} else {tokenstring [tokenstringindex ++] = '\ 0';} switch (oldstate) {case start: {If (* tokenstring =-1) {currenttoken = endfile; // Chen junbian Li benqing return currenttoken, 08-level software College, South China University of Technology;} else currenttoken = symbollookup (tokenstring);} break; Case innum: currenttoken = num; break; case inid: currenttoken = reservedlookup (tokenstring); break; Case instring: currenttoken = strings; break; default: currenttoken = symbollookup (tokenstring); break ;}} else {currenttoken = eorror;} If (trancescan) {fprintf (listing, "% d", lineno); printtoken (currenttoken, tokenstring, listing); fprintf (listingdetails, "\ t % d", lineno); printtoken (currenttoken, tokenstring, listingdetails);} return currenttoken;
Finally, as shown in the last seven lines of the code above, input the current tokenstring to the Save file, and then return the type of the current token. The interface provided for external calls is: int scan ();
Summary:
In short, the compiler translates "advanced language" into "machine language (low-level language )".Program. The main workflow of a modern Compiler: source code → Preprocessor → compiler → assembler → object code) → linker → executables ). From the above we can see that many programming technologies are involved in implementing a compiler. A real technician should be very familiar with these programming technologies, in the process of implementing a compiler, not only related programming technologies are required, but also programmers must have a good development attitude and software development ideas, therefore, through this experiment, we can not only learn about compilation principles, but also apply it to our own programming practices. In fact, this knowledge in textbooks can be used to compile compilers. It is often used in our usual software development practices, such as regular expressions, there are also operations on XML files.
The biggest problem encountered during the process of lexical scanner is the design of the state conversion table. Because the state conversion table is the core of the entire program, the status update table is used, the program can recognize each token by accepting each character, and finally scan the entire tiny + source program. Therefore, we spent a lot of time discussing table design, discussing with other students, thinking with ourselves, and finally designing a state conversion table.
Due to time and capability constraints, there are still some problems in the entire table, and some errors still cannot be well identified, such as the occurrence of a duplicate AB, the program will run the task 112a as the wrong token and B as the ID, which is obviously problematic. The correct analysis result should be that both AB errors. In view of this, in order to make the program better modifyable and convenient for later modification, I write the state conversion table to an xml configuration file, and then the program will only read it from each running, in this case, if an error occurs in the conversion logic in the status transition table, you only need to modify the configuration file.
The second problem is that the program has some problems, such as character encoding problems. This implementation made us realize that in the early stage of programming, we should pay attention to programming compatibility issues in different development environments. Or you should determine to use the same environment in the early stage. In this way, we can avoid problems we encountered in this program. Also, the table-driven method is really a good method for beginners. Using this method can reduce the amount of code and reduce the complexity of the program, making the program easier to understand and complete.
20:31:42