Develop a compiler using java: code implementation Thompson constructor 1, input text preprocessing
This program is the first step of building and implementing Thompson. Input text preprocessing. The code in this section can be extracted from the attachment of the cloud class.
The directory structure of this Code is as follows:
The purpose of our program is to convert the regular expression in the text format to the nfa in the linked list, that is, the text:
D [0-9]
{D} + return ICON
({D} + | {D} * \. {D} + | {D} + \. {D} *) (e {D} +)
Convert
Before conversion, We need to pre-process the text. In the above text, there are actually two different parts. The first part is called macro definition:
D [0-9]
Just like the macro definition in C language, the macro should be replaced before the code is compiled. before conversion, we also need to replace the macro in the regular expression, that is
{D} + return ICON
({D} + | {D} * \. {D} + | {D} + \. {D} *) (e {D} +)
Convert:
{[0-9]} + returnICON
([0-9] + | [0-9] * \. [0-9] + | [0-9] + \. [0-9] *) (e [0-9] +)
That is, replace double brace D with [0-9].
The conversion of a macro definition seems to be simply a replacement of a string, but it has a difficulty in processing a set of macro definitions. For example:
D [0-9]
A [a-z]
AD {D} | {}
{AD} \. {AD} +
As you can see, in macro definition AD, its own definition needs to be composed of other macro definitions (D and ),
After macro AD is replaced, replace D and A. That is, replace them into two parts.
1. Replace {AD} \. {AD} + with ({D} | {A}) \. ({D} | {A}) +
2. by ({D} | {})\. ({D} | {A}) + Replace with ([0-9] | [a-z]) \. ([0-9] | [a-z]) +
Therefore, when replacing the macro definition, you need to be careful when dealing with such a set, the set may even be very multi-layer.
The macro definition consists of the following methods:
Name <a series of spaces> macro-defined content <a series of spaces or line breaks>
Therefore, the program parses macro definitions according to the above format. macro definition Parsing is handled by MacroHandler:
We use a hash table macroMap to store all macro definitions. If two macros have the same name, the next Macro will overwrite the previous one. The input system is inputSystem and the macro definition content will be obtained from the console, call the newMacro function to parse the input content (call up elipse ).
NewMacro functions construct macro definitions by parsing a line of content read from the console. First, ignore spaces and empty rows until the first meaningful character is met. Starting from the first character, according to the macro definition format, we need to construct the macro definition name and combine all character sets until spaces are encountered, the character string of the set is the macro-defined name.
The valid content is the macro definition content after the macro definition name, which is collected, put into the macroContent variable, and put the macro definition name as the key into the hash table.
When parsing a regular expression, you need to replace the macro, that is, by specifying the macro name to get the macro content. The interface expandMacro needs to get the macro content:
Replace the text of a regular expression:
In the code, we use the RegularExpressionHandler class to replace the regular expression input. The basic process is to read the regular expression text and parse the read content, if the {} symbol is included in the content, the program confirms that macro replacement is required, and the string in {} is given in advance as the macro definition name, and the macro definition content is obtained through the aforementioned interface expandMacro, replace with the expected content. If there is a macro definition after the replacement, repeat the replacement process. The code for this class is as follows:
(Call out eclipse)
Input is the input system used to obtain the input of a regular expression. macroHandler is the macro definition processor we mentioned above. This class stores all pre-processed Regular Expressions in an array for later programs. After a series of initialization is complete, call processRegularExprs to start the regular expression preprocessing process.
PreProcessExpr reads the entered regular expression character by character. Once the left bracket {is encountered, it is ready to begin macro replacement, but if the left bracket {is in double quotation marks, for example, ["{}"], brackets {are treated as common characters without replacement. If they are not in double quotation marks, macro replacement is performed.
It extracts strings in {And} and serves as macro-defined names. The extraction process is implemented through the extractMacroNameFromInput interface. After obtaining the macro name, it calls expandMacro to perform the replacement operation.
ExpandMacro obtains the macro definition content from macroHandler. Since the macro definition may be set, you need to determine whether other macro definitions are nested in the macro definition after obtaining the content. indexOf ("{"} is used to determine whether a macro definition is a set. If there is a set, extract the content in {} and replace the macro definition, after replacement, check whether there is another set. continue until there is no set of macro definition.
When judging the macro definition, some checks are made. If the brackets {but no corresponding} are encountered, it indicates that the input is incorrect and the error information is printed.
In this section, we only briefly introduce the Code. In the next section, I will run debugging to further demonstrate the code process so that you can better understand the code.