Construction practice of complete CMM Interpreter (II.): Lexical analysis

Source: Internet
Author: User

CMM is a subset of C, reserved words only the following

If else while read write int real

The special symbol has the following several

+-*/= < = = <> (); { } [ ] /* */
Identifier: A string consisting of numbers, letters, or underscores, and cannot make a keyword, the first letter cannot be a number

If you understand that C is easy to understand what the above means, but also understand that the CMM is actually not many things, so the CMM interpreter is relatively simple.


The above special symbol is actually relatively few, I personally realized the time also to > >= <= and so on to do the related support, of course, the principle is the same.


After introducing the CMM briefly, we begin to get to the point-lexical analysis.

Let's say we have a. cmm file, we read a character in a character read, that is, the character stream, people can accurately determine the character stream, but the program does not, so we need to convert the character into a program convenient processing form, that is, the token stream, or token sequence.

For example, the following code

int a = 10;

We can convert it to an INT ID ASSIGN number SEMI

An int is a int,id representing an identifier, which means that a,assign represents the assignment operator, number represents a digit, and semi represents the semicolon at the end of the code.

After we have parsed a code into this token sequence, we can make it easier to do further processing.

So the question now is how do we change the flow of characters into token streams.

There is a thing called JAVACC, in which we can define some tokens through regular expressions, and then JAVACC will handle the stream of input characters according to the regular expression we provide, convert it into token stream, very powerful, interesting things to try, we use the relatively simple method here, Of course, it is also more stupid, that is, we have summed up the rules through their own experience to write a code to convert the character stream into a token flow.

Let's start by defining several tokens:

    /** if */public static final int if = 1;    /** Else */public static final int else = 2;    /** while */public static final int. while = 3;    /** Read */public static final int read = 4;    /** Write */public static final int write = 5;    /** int */public static final int int. = 6;    /** Real */public static final int real = 7;    /** + */public static final int PLUS = 8;    /**-*/public static final int minus = 9;    /** * */public static final int MUL = 10;    /**/* public static final int DIV = 11;    /** = */public static final int ASSIGN = 12;    /** < * * public static final int lt = 13;    /** = = * public static final int EQ = 14;    /** <> * * public static final int NEQ = 15;    /** (*/public static final int lparent = 16;    /**) */public static final int rparent = 17;    public static final int SEMI = 18;    /** {*/public static final int lbrace = 19; /**} */public static final INT Rbrace = 20;///**/* *///public static final int lcom = 21;///** *\/*///public static final int rcom =    22;///**//*///public static final int SCOM = 23;    /** [*/public static final int lbracket = 24;    /**] */public static final int rbracket = 25;    /** <= */public static final int let = 26;    /** > * * public static final int gt = 27;    /** >= */public static final int GET = 28;    /** identifier, consisting of numbers, letters, or underscores, the first character cannot be a number */public static final int ID = 29;    /** int literal value */public static final int literal_int = 30; /** REAL Literal value */public static final int literal_real = 31;

The above is part of the interception from my code, each line of code is defined as a token type, the comment is to explain what the token type specifically refers to, note that there are several lines of commented out of the definition code can be ignored, reserved just for personal hobbies, actually I commented out of the code completely does not have any effect.

You can see the token type is still very much, and some of the names are not good to distinguish, we must carefully.

Here's a look at the conversion method, reading one character at a time, assuming we have a variable C of type char, which stores the currently most recently read character.

Suppose now we read the first character and put it in C

If C is ; + - * ( ) [ ] { } , you can get the corresponding token directly, such as if c== ' + ', then we can confidently say we have a token of type plus here.

if C is / , we're not directly sure what token this is, because / there are several tokens at the beginning, so let's read one more character, or in C, notice that C is already the second character, depending on the character, there are several situations:

①:c== ' * ' We can be sure that this is the beginning of a multiline comment, which means /* , note that there is also a need to complete the processing of multiple lines of comments, we will look at the code later, here first skip.

②:c== '/', we can be sure that this is the beginning of a single-line comment, that is, '//', note that there is also the need to complete a single-line comment processing, here first skip.

③: Do not satisfy any of the above conditions, we can be sure that this is a division sign, that is, the type of the div token.

To this C Yes / The situation is finished

We also need to show you the code, because there is a problem with the processing of the comments, I added a comment in the code, you can take a cursory look at the method of processing the comment is to read the characters, consume the characters, but always determine whether the multi-line comment is over:

            if (Currentchar = = '/') {//currentchar is our newest read-in character ReadChar ();  Call this function to read the next character, the content in Currentchar will update if (Currentchar = = ' * ') {//Multiline comment//Tokenlist.add (new                    Token (token.lcom, Lineno));  ReadChar ();  OK is a multiline comment, now start processing multiple lines of comments, then read one character while (true) {//Use a dead loop to consume multiple lines of comment inside the character if (Currentchar = =                            ' * ') {//if it is *, then there may be multi-line comment at the end of the place ReadChar ();                                if (Currentchar = = '/') {//multi-line Comment end symbol//Tokenlist.add (New Token (token.rcom, Lineno));  ReadChar ();                            Read again into the next character end loop break;                        }} else {//if it is not * Continue reading the next, equivalent to ignoring the character readChar (); }} continue;       End of loop, must be break out, that is, multi-line comment end} else if (Currentchar = = '/') {//Line comment//             Tokenlist.add (New Token (token.scom, Lineno));                    while (currentchar! = ' \ n ') {//consumes the content after this line ReadChar ();                } continue;                    } else {//IS Division sign Tokenlist.add (new Token (Token.div, Lineno));                Continue }            }


if C is = , there are two cases in which you can read a character into C to make the next decision:

①:c or =, then this is a logical operator = =, here is a token of type Eq.

②:C does not meet the above conditions, then it is a simple assignment symbol, here is a token of type assign

if C is > , there are two cases, read one more character into C to make the next decision:

①:c== ' = ', that means this is a logical operator, >=, which is a token of type get.

②:C does not meet the above conditions, then it is a simple greater than symbol, here is a type of the GT token

if C is < , there are a number of cases, and then read into a character to C for the next decision:

①:c== '=', then it is a logical operator <=, here is a token of type let.

②:c== '>', stating that this is a logical operator <>, here is a token of type NEQ.

③:C does not meet the above, indicating that this is a simple less than symbol, here is a type of the token of Lt.

if C is a number, that is, ' 0 ' <=c<= ' 9 ', the next is a number, we will store C, continue to read the following characters, including the decimal point ' . ' also need to read into, until a non-decimal and non-numeric characters appear, stop reading, the previously stored C splicing up is the number we want, we can be based on whether there is no decimal point to determine whether it is an integer, here the logic needs to be carefully considered, here is my Code snippet:

            if (Currentchar >= ' 0 ' && currentchar <= ' 9 ') {                Boolean isreal = false;//if decimal while                ((Currentchar &G t;= ' 0 ' && currentchar <= ' 9 ') | | Currentchar = = '. ') {                    if (Currentchar = = '. ') {                        if (Isreal) {break                            ;                        } else {                            Isreal = true;                        }                    }                    Sb.append (Currentchar);                    ReadChar ();                }                if (isreal) {                    tokenlist.add (new Token (Token.literal_real, sb.tostring (), Lineno));                } else {                    Tokenlist.add (New Token (Token.literal_int, sb.tostring (), Lineno));                Sb.delete (0, Sb.length ());                Continue;            }

Where SB is a StringBuilder object, which is equivalent to a buffer that holds characters, takes out the accumulated values at the right time and empties them, because my code uses SB repeatedly.

If c is a letter or an underscore, then we need to store the C, continue to read the following characters, encountered at the same time to meet non-letter, non-underscore, non-numeric three conditional characters when the buffer accumulated characters are taken out, and the reserved word is compared, if the accumulated string is an int, So we say there's a token of type int, and if it's not the same as every reserved word, then we think it's a user-defined identifier, so here's a token with the type ID.

The characters read must conform to one of the above cases, and if each case does not conform, then we can discard them directly, because there are a few line breaks and tab characters we are going to ignore.

characters to ignore: "\ r" "\ n" "\f " "" . Represent carriage return, line break, page break, space. Of course I'll count the number of rows while ignoring \ n.

After the above processing, we can get the first token in the CMM code, and we'll go through the process above to get the second token, but note that we used a thought like LL (1), We will look at the next character when we are not sure which token type is currently being processed, sometimes two characters are combined into a token, and sometimes the previous character becomes a token alone, when the last character read is not processed, to get the next token, It is necessary to treat the last read character as the first character. Otherwise you may find that you sometimes miss a few characters.


Finally, I put the code involved in today's upload, do not need points , please download.

Click I download code

Construction practice of complete CMM Interpreter (II.): Lexical analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.