(Based on Java) Compile the compiler and interpreter-Chapter 3rd: scanning-Part 2 (serialization)

Source: Internet
Author: User
Document directory
  • Pascal tokens
  • Syntax Diagrams)
  • Word token
  • String token
  • Special symbol token.
  • Number token

>>> Continue Part 1

From this small example, we can see the following key points:

  • The scanner scans and skips the blank characters (such as spaces) between tokens ). When this operation ends, the current character is definitely not a blank character.
  • The non-null character determines the next token type to be extracted and becomes the first character of the token.
  • The scanner constantly creates tokens by scanning and copying source characters until the characters cannot be part of the tokens.
  • Extracting token will swallow all source characters that constitute the token. Therefore, after extraction, the current character is the next character after the token.
A Pascal Scanner

The previous chapter has completed the preliminary work of the pascalscanner class in the frontend. Pascal package. Now the extension method extracttoken () is added with a new method skipwhiltespace (). See listing 3-8.

Listing 3-8: The extract () and skipwhitespace methods of the pascalscanner class are detailed in the source code of this chapter, which is not displayed here.

The extracttoken () method is basically the same as that described in the previous example. First, it scans and skips all blank spaces. The current character (not blank) is the first character of the next token. This token type is determined by the current character. This method creates and returns any pascalwordtoken, pascalnumbertoken, pascalstringtoken, or pascalspecialsymboltoken object. Therefore, after the extracttoken () method reads the first character of each token, the newly created Token Object reads and copies the remaining characters of this token. The extracttoken () method creates and returns an eoftoken object at the end of the source file, or a pascalerrortoken object if it encounters an invalid character that cannot be the first character of any Pascal token. The skipwhitespace method skips all spaces between tokens, which is determined by Java character. iswhitespace. It also ignores Pascal comments, so each comment is equivalent to a space (strange ). Pascal tokens

In the previous chapter, you defined language-independent toketype placeholder interfaces. Now we define an enumeration type pascaltokentype. Its enumerated value indicates all Pascal tokens. Listing 3-9 shows the necessary tokentype implementation.

Listing 3-9: For more information about the pascaltoketype Enumeration type, see the source code in this chapter.

In listing 3-1, the parse () method of pascalparsertd uses the enumerated value error. In listing 3-2, the main class Pascal's internal class parsermessagelistener uses the enumerated value string when processing the token message.

Static set reserved_words text string containing the Pascal keyword. Then, use this set in the pascalwordtoken class to determine whether a word is a keyword or an identifier. Because the enumerated values of each keyword are identified after the keyword, the text of the value is the literal string of the keyword. For example, the value of the enumerated value begin is a string "begin ". Because Pascal words are case-insensitive, you can convert them (enumerated value text) into lowercase to normalize words.

The static hash table special_symbols contains each Pascal special symbol. The key of an item is a text string of a special symbol. Like the constructor of an enumerated value, the value of an item is the enumerated value itself. For example, if the key of an item is ": =", its value is colon_equals. Class pascalken (see listing 3-8) uses a hash table to determine whether to create and return a pascalspecialsymboltoken object. This hash table is also used for the class pascalspecialsymboltoken.

Refer to the pascaltoken class in the source code. It will be the base class of all Pascal token sub-classes. It extends the language-independent framework class token. Although it does not add any domain or method at present, it also brings convenience to subsequent development. (Mark Class, tag class)

Syntax Diagrams)

We will develop the residual pascaltoken subclass in the frontend. Pascal. tokens package to extract various Pascal token classes. But first, you need a good syntax specification for language elements. There are several common methods, but Pascal's relatively simple syntax can be used well.Syntax DiagramThat is, the graphical representation of language syntax rules. (The last chapter represents the syntax of a language in text format, that is, the ebnf paradigm)

Figure 3-2 shows three figures. The first clear letter can be any one of uppercase letters A to Z and lowercase letters A-Z. The second figure shows that a digit can be any character ranging from 0 to 9. The third clear word token is composed of a single letter plus 0 to multiple letters or numbers.

Design Notes

The syntax diagram is easy to understand: Follow the arrow to guide. The forking PATH represents the choice: the letter can be a, B, c, and so on. Other detour paths indicate continuity: after the first letter of the word token, there are 0 or more consecutive letters/numbers.

A circle indicates a literal text, such as a letter or a number 0. (At the beginning of chapter 5th, you will also use a circle to represent the literal text of keywords such as and, or, and .) A rectangle is a reference to another image. For example, a word token reference is a syntax map that represents letters and numbers. The more formal statement is that the prototype box represents the terminator (which cannot be further divided), while the rectangle box represents a non-terminator (which can be further subdivided)

Word token

The extracttoken () method in the pascalscanner class (see listing 3-8) creates a new Pascal word token when the current character is a letter.

   1: else if (Character.isLetter(currentChar)) {

   2:     token = new PascalWordToken(source);

   3: }

The pascalwordtoken class in the frontend. Pascal. tokens package is a subclass of pascaltoken. Listing 3-11 shows its extract () method. For more information, see the source code in this chapter.

Extract () implements the Pascal word token syntax rules shown in the 3-2 syntax diagram. After the first letter is extracted, it swallowed up any followed letter and number to construct the word token text string. When this ends, the currentchar value will not be a letter or number, so this character will be the last character of the token. Then, the extract () method is used to determine whether a word is a keyword. If a word text string appears in the tokentypes. reserved_words set (see listing 3-9), it must be a keyword. Because the Pascal keywords are case-insensitive, the relationship of the test set is completed in lower case (both are compared in lower case), and the token type is the corresponding pascaltokentype enumeration value. If the token type is an identifier, the value is now set to null (this value will be processed in the symbol table section later ).

Listing 3-4 shows some examples of pascalwordtoken. (This is omitted. Run the source code Java-classpath classes Pascal compile scannertest.txt and pay attention to the output from Lines 9 to 12)

Line 1 of the source code contains four keywords begin to test the case sensitivity of pascalwordtoken. Undoubtedly, the token type of begins must be an identifier (begins is not equal to begin ). String token

Figure 3-3 shows the syntax of the Pascal string token

The Pascal string starts with single quotes and ends with single quotes. A single quotation mark is a string of 0 to multiple consecutive characters. In a string, two adjacent single quotes indicate a single quotation mark (single quotes are text characters, not the start and end characters of the Pascal string ). A pascaltoken object is created when the current character is single quotes.
else if (currentChar == '\'') {
token = new PascalStringToken(source);
}
Listing 3-12 shows the extract method in the pascalstringtoken subclass. For more information, see the source code in this chapter. Method extract () to swallow string characters. It replaces all blank characters (such as line terminator and \ t) with a single space. It must detect unexpected termination (if there is a single start quotation mark, there is no ending single quotation mark, the string token will regard all the characters after the start single quotation mark as part of the string, until the end of the file ). If a single quotation mark (we think that the starting and ending single quotation marks are not a string content, but only a string identifier) is encountered in the string, extract will call peekchar () to detect this character (single quotation marks) is the string terminator or the first of two adjacent single quotes. In the latter case, this method removes the pair of adjacent single quotes and attaches a single quotation mark to the string content. If the file is accidentally terminated, the method sets the token type to error and sets the value to the pascalerrorcode enumerated value unexpected_eof. Listing 3-4 shows some examples of pascalstringtoken output: (omitted here. Run the source code Java-classpath classes Pascal compile scannertest.txt and check the output from Lines 13 to 21) pascalstringtoken well handles the Null String (row 016) and adjacent single quotes (row 017 ). It replaces the linefeed in each string with a space (row 018,019,020 ). Special symbol token. If the current character is the value of a certain item in the pascaltokentype. special_symbols hash table, a pascalspecialsymboltoken object is created in the pascalspecialsymboltoken class (see listing 3-8.
   1: }else if (PascalTokenType.SPECIAL_SYMBOLS.containsKey(Character.toString(currentChar))){

   2:     token = new PascalSpecialSymbolToken(source);

   3: }

Listing 3-13 shows the extract method of the pascaltoken subclass pascalspecialsymboltoken. For more information, see the source code in this chapter. This extract () method tries to extract a special token and swallow the current token character. The Pascal special character token consists of 1 or 2 characters. If the token is extracted successfully, the extract method uses the special_symbols hash table to set the token type to an appropriate enumerated value. If there is an error, set the token type to error and set its value to the enumerated value invalid_character of pascalerrorcode. Listing 3-14 shows examples of pascalspecialsymboltoken. (This is omitted. Run the source code Java-classpath classes Pascal compile scannertest.txt and note the output from row 22 to row 25.) The number token diagram 3-4 shows the syntax of the Pascal number token. An unsigned integer is a numerical sequence. A Pascal integer token is an unsigned integer. A Pascal real token starts with an integer, followed by one of the following: I: a decimal point followed by an unsigned integer (decimal part ). II: An E or E, which may exist before + or-, followed by an unsigned integer (exponential part ). III: a decimal point followed by an exponential part.

 

 

Listing 3-15 shows the extract method in the pascalnumbertoken subclass. For more information, see the source code in this chapter.

The wholedigits, fractiondigits, and exponentdigits fields of the string type indicate the sequence before the decimal point, the sequence after the decimal point, and the sequence after E or E respectively. As shown in the syntax diagram 3-4, fractiondigits and exponentdigits may be null. This method (pascalnumbertoken) initially sets the token type to integer. If there is a decimal or exponential part, the type is changed to real. It calls the unsignedintegerdigits method to extract numbers in three parts (integer, decimal, and exponential), which ensures that each part has at least one number. (If there is a fractional part and an exponential part, the response part must have at least one number, but remember that these two parts are optional ).

Listing 3-16 shows the unsignedintegerdigits method of the pascalnumbertoken class. For more information, see the source code in this chapter.

If the extractnumber method encounters '. 'character, it cannot be assumed that this is a decimal point immediately, because it may be .. token (1 .. the first character of the range. One character before calling the peekchar () method is used to determine this situation. After all parts of a number are extracted, it calls the computeintegervalue or computefloatvalue method to calculate the value of the number based on the integer or real type. See listing 3-17 and listing 3-18.

Listing 3-17: The computeintegervalue method in the pascalnumbertoken class. For details, see the source code in this chapter.

The computeintegervalue method is used to calculate the integer of a string of numbers. It is converted back by determining the value (the number of digits in the binary is limited. If it exceeds the maximum value, it may set the current number of digits to 0. For example, if 1111 is added with 1, it will be changed to 0000. This is returned, from the beginning) and smaller than the previous value to check for overflow. If overflow exists, set the token type to error and set the token value to the enumerated value range_integer of pascalerrorcode.

Listing 3-17: For more information about the computefloatvalue method in pascalnumbertoken, see the source code in this chapter.

The computerfloatvalue () method is used to calculate the float value of a numeric string that contains the integer, decimal, exponent, and exponent. It calls computerintegervalue to calculate the integer of the index. If the exponent symbol is '-', it obtains the inverse Number of the integer. If there are decimals, the method will adjust the value minus the length of the fractional part. Finally, if the adjusted exponent value plus the exponent value exceeds max_exponent, the number is out of the range ), method To set the token type to error and set the token value to the enumerated value range_real of pascalerrorcode. (The value of max_exponent depends on the underlying machine architecture and the floating point Implementation Standard. The minimum index is no greater than the inverse Number of the maximum index. For example, the maximum index is 64, then the minimum index is not necessarily-64)

The computerfloatvalue () method is used to calculate the integer and fractional parts together and increase the value multiplication to adjust the exponent as the parameter math. the value obtained by the POW method to obtain the final value of the number token.

Here is an example of how the extractnumber method calculates the number token. If the token string is 31415.926e-4, the extractnumber method passes the following values to the computerfloatvalue method:

wholeDigits:       "31415"
fractionDigits: "926"
exponentDigits: "4"
exponentSign: '-'
The computerfloatvalue method calls computerintegervalue to calculate the value of exponentvalue (see the variable in the Code). Then it is adjusted twice because exponentsign is '-' to reverse the value (exponentvalue, then (or exponentvalue) minus the length of the fractiondigits string 3.
exponentValue:  4 as computed by computeIntegerValue()
exponentValue: -4 after negation since exponentSign is '-'
exponentValue: -7 after subtracting fractionDigits.length()
Finally, computerfloatvalue is used to calculate the value of the wholedigits + fractiondigits Union string. The value is 31415926.0, and then multiplied by math. Pow (10, exponentvalue). The final value of the token is 3.1415926. Listing 3-5 shows examples of pascalnumbertoken. (Omitted here. Run the source code Java-classpath classes Pascal compile scannertest.txt and note the output from lines 26 to 31 ).

Design Notes

Scanners are an important component of the compiler/Interpreter front-end. This chapter describes a lot of code. However, you follow the policy design pattern to implement the subclass of pascaltoken, so that every token subclass knows how to extract the token string from the source program. The implementation method of each subclass has only one responsibility and is highly aggregated. For example, the pascalnumbertoken method is only used to extract the number token. The high aggregation class is less coupled and easy to maintain. If you decide to improve the real number calculation method of the scanner and reduce rounding (similar to rounding) errors, you only need to modify the pascalnumbertoken class. Assuming that pascaltoken is used to extract all Pascal token types (instead of subclass), pascaltoken aggregation is extremely low. It is difficult to change the token extraction method without affecting the other.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.