Series navigation
- (1) Introduction to lexical analysis
- (2) Input buffering andCodePositioning
- (3) Regular Expressions
- (4) construct NFA
- (5) DFA Conversion
- (6) construct a lexical analyzer
I. input buffer
Before introducing how to perform lexical analysis, let's talk about a problem that is not very mentioned-how to read the sequence stream from the source file. Why is this issue so important? The reason is that there is a requirement for the merge stream in lexical analysis, and it must be able to support the rollback operation (that is, multiple characters are put back into the stream and will be read again later ).
Let's explain why the rollback operation is needed. For a simple example, we need to match the two modes:
Figure 1 stream rollback Process
The above is a simple matching process. To demonstrate the rollback process only, we will explain in detail how to match the phoneme when implementing the DFA simulator later.
Now let's take a look at the input-related classes in C #, which support stream search, but can only be accessed in bytes. Although binaryreader and textreader support reading characters, however, rollback is not supported. Therefore, you must complete the input buffer class by yourself. The general idea is to use textreader as the underlying character input, and then use your own class to support the rollback capability.
The compilation principle providesBuffer pairTo put it simply, it is to open up two buffers and set the buffer size to n characters. Each time, n characters are read into the buffer, and character operations are performed on the buffer. If the data in the current buffer zone has been processed, read n new characters into another buffer zone, and then perform another operation as a new buffer zone.
This data structure is highly efficient, and the rollback function can be easily implemented as long as appropriate pointers are maintained. However, its buffer size is fixed, and the new characters will overwrite the old ones. If the number of characters to be rolled back is too large (for example, when analyzing long strings), errors may occur. I solved the problem of overwriting old characters by using multiple buffers-if the buffer zone is insufficientNew BufferInstead of overwriting the old data.
If you only add a buffer constantly, the occupied memory will only increase. This makes no sense. Therefore, I have defined threeRelease the bufferOperation: Drop, accept, and accepttoken. The role of drop is to mark all data before the current location as invalid (discarded), and the buffer occupied by the marked invalid data is released, which can be reused; ACCEPT returns invalid data in the form of a string instead of simply discarding it. Similarly, accepttoken returns invalid data in the form of a token, to facilitate lexical analysis.
Such a data structure is similar to the deque in STL. However, you do not need to randomly access, insert, or delete data here, and only perform operations at the beginning and end of the data, therefore, I directly connect multiple buffers to a ring using a two-way linked list and use three pointers current, first, and last to point to the buffer with data in the linked list, as shown in:
Figure 2 A linked list composed of multiple buffers. The red part indicates data, and the white part does not.
First points to the earliest data buffer, last points to the latest data buffer, current points to the currently accessed data buffer, and current always points to [first, within the range of last. The red part between firstindex and lastlen is the buffer that contains valid data. idx indicates the characters currently being accessed. The white part indicates that the buffer is empty or the data in the buffer is invalid.
When the next character needs to be read, the data will be read from current in sequence and the idx will be moved back. If the data in current has been read, Move current to last (here, because there may be multiple buffers between current and last), and also move idx accordingly.
Figure 3 current migration to last
If you want to continue reading characters, but there is no new data in the current, and current is already the same as last, it indicates that there is no updated data in the buffer, then you need to read data from the textreader, put it in the new buffer, and move the current and last at the same time (ensure that the last always points to the latest buffer ).
Figure 4 current and last backward shifting
Now let's take a look at the rollback operation. When performing a rollback, you only need to move current to the first direction (likewise, there may be multiple buffers between current and first ).
Figure 5 rollback operation
The implementation of the drop operation (the same is true for accept and accepttoken) is also very simple. You only need to move first to the current location and the firstindex to idx, which means that the data before idx is consideredInvalid Data.
Figure 6 drop operation
It should be noted that, after the drop operation is completed, the invalid data may be overwritten by new data. Therefore, we should make sure that the data is no longer needed and then perform the drop operation. The drop operation is highly efficient (moving two references), so you don't have to worry about affecting the efficiency.
The advantage of using this ring data structure is that in addition to filling the characters in the buffer zone, it completely avoids additional data replication. No matter whether it is forward, rollback or drop operations, only pointer (reference) operations are allowed, high efficiency. When drop is compared and always, only two buffers are used, and no additional memory is occupied. When too many buffers are occupied, extra memory can be automatically released (not considered here ).
The disadvantage is that the implementation is more complicated. You need to carefully handle the relationship between first, current, and last, as well as the firstindex, index, and lastlen range restrictions, and sometimes involves operations on multiple buffers.
The complete code can be seen in sourcereader. CS.
Ii. Code locating
InSource codeDuring parsing, it is obviously necessary to record the row number and column number corresponding to each token. No one will like to face a lot of errors, but it does not tell you what is wrong ...... Therefore, I think code positioning is definitely a necessary function for lexical analysis, so I directly built this function into the sourcereader class.
The following describes how to locate the code. Code positioning contains 3D data: Index, row number, and column number. Index is a character index starting from 0, mainly for convenienceProgramProcessing; the row number and column number start from 1, mainly for people to see.
Line locating is relatively simple. The UNIX line break is '\ n' and the Windows line break is "\ r \ n". Therefore, you can directly count the number of' \ n.
Next, locate the column. Two factors must be taken into account to achieve better results: fullwidth, halfwidth, and tab characters.
A Chinese character (full-width character) corresponds to two columns, and an English character (half-width character) corresponds to one column, so that in the same width font, each column is aligned up and down. This is the case when calculating the number of columns. Use encoding. Default. getbytecount () instead of the string length. However, I found a memory problem here (For details, refer to here). You can use the getbytecount method of encoding. Default. getencoder () instead.
The length of a Tab character is indefinite (generally 4 or 8, varies from person to person). Therefore, a tabsize is defined to indicate the width of the Tab character. Is a single Tab character corresponding to the tabsize column? This is not the case. Although this is generally the case, The Tab character makes the column corresponding to the next character always equal to an integer multiple of the tabsize plus 1. If tabsize is 4, its behavior is shown in. Both A and BCC are followed by two TAB characters. Both bcccccc and bccccccc are followed by a Tab character, I marked each Tab character with a gray arrow.
Figure 7 Tab character instance
Therefore, the actual column number should be calculated using the following formula. currentcol is the column where the Tab character is located, and nextcol is the column where the next character is located:
Nextcol = tabsize * (1 + (currentcol-1)/tabsize) + 1;
The Calculation Method of code positioning is available, and then the computing time is reached. If the position of the current character is calculated during each read operation, the computing efficiency is slightly lower, because in the getbytecount method, the efficiency of one-time calculation is longer than that of one-character array, it is almost twice the length of the string array of 1. Second, what should I do when I roll back? If you save the calculation results of the previous locations, memory usage is a problem. If you do not consider this, the position of the first character cannot be calculated based on the position of the current character. (For example, if the current character is in the first column, the first character should be in the first column ?).
After comprehensive consideration, I decided to place the Code Location Calculation into the drop operation (the same is true for accept and accepttoken). One is mentioned above, and the computing efficiency is slightly higher, the other is that you need to locate a token only after identifying it. This is exactly the time to drop or accepttoken. In the process of token recognition, locating is useless.
I encapsulated the code locating function into the sourcelocator. CS class separately.
The next article will introduce the regular expressions used in lexical analysis and how to parse regular expressions.
Relevant code can be found here, and some basic classes (such as input buffer) are here.