Series Navigation
- (a) Introduction to lexical analysis
- (ii) input buffering and code positioning
- (c) Regular expressions
- (iv) Construction of NFA
- (v) Conversion of DFA
- (vi) Structural lexical analyzer
- (vii) Summary
One, input buffer
Before describing how to perform lexical analysis, let's start with a less-than-mentioned question-how to read a stream of characters from a source file. Why is this problem so important? Because in lexical analysis, a character stream is required, it must be able to support fallback operations (that is, descriptors multiple words back into the stream, which will be read again later).
To explain why you need to support fallback operations first, for a simple example, now you need to match two patterns:
Figure 1 The fallback process for a stream
The above is a simple matching process, only to demonstrate the fallback process, the implementation of the DFA simulator later will explain in detail how to match the morphemes.
Now look at the input-related classes in C #, with Stream, which supports the lookup of streams, but only in bytes; BinaryReader and TextReader support reading characters, but they cannot support fallback. Therefore, it is necessary to complete this input buffer class, the general idea is to use TextReader as the underlying character input, and then by their own class to complete the support of the fallback ability.
The principle of compilation gives a method of buffer pairs , which is simply to open up two buffers, the buffer size is N characters. Each time the N characters are read into the buffer, and the character operation is implemented on the buffer. If the data for the current buffer has already been processed, N new characters are read into the other buffer, and then the new buffer is manipulated.
Such a data structure is very efficient, and as long as the appropriate pointers are maintained, it is easy to implement the fallback function. However, its buffer size is fixed, and the newly read-in character overwrites the old character. If the number of characters that need to be rolled back is too large (for example, when parsing a very long string), errors can occur. I solved the problem of overwriting old characters by using multiple buffers-if the buffers were not enough, a new buffer would be opened instead of overwriting the old data.
If you just keep adding buffers, then the memory will only increase, which makes no sense, so I've defined three operations to release buffers : Drop,accept and Accepttoken. The function of the Drop is to mark all data before the current position as invalid (discarded), the buffers used by the tagged invalid data are freed, can be reused, and the Accept will return the data marked as invalid as a string, rather than simply discarding; Accepttoken is to return the data that is invalidated in Token form, it is to facilitate the lexical analysis.
This data structure is similar to the deque in the STL, but there is no need to randomly access and insert, delete data, only the head and tail of the data operation, so I directly use multiple buffers using a doubly linked list into a ring, using three pointers current,first and last Point to a buffer with data in the linked list, as shown in:
Figure 2 A linked list of multiple buffers, the red part represents the data, the white part has no data
Where first points to the oldest data buffer, last points to the latest data buffer, and current points to the data buffer currently being accessed, which is always within the [first, last] range. The red part between FirstIndex and Lastlen is the buffer that contains the valid data, and the IDX represents the character that is currently being accessed. The white part represents an empty buffer, or the data in the buffer is invalid.
When the next character needs to be read, the data is read sequentially from current and the IDX is moved back. If the data in current is already read, move the current to the last (this is moved because there may be multiple buffers between current and last), and the IDX will move accordingly.
Figure 3 Current shift to last
If you need to continue to read the characters, but there is no new data in current, and now it is the same as last, indicating that there are no updated data in the buffer, you need to read the data from the TextReader, put it in the new buffer, and move the current and Las T (you need to ensure that last always points to the latest buffer area).
Figure 4 Current and last move backward
Now let's look at the fallback operation. When you do a fallback, you only need to move the current in the direction of first (again, there may be multiple buffers between current and first).
Figure 5 Fallback operation
The implementation of the Drop operation (the same as Accept and Accepttoken) is also simple, simply moving the first to the current position, moving the firstindex to the IDX, which means that the data before the IDX is considered invalid data .
Figure 6 Drop operation
It is important to note that when the drop operation is complete, the data that is invalidated is likely to be overwritten by the new data, so it should be determined that the data is no longer required to perform the drop operation. The Drop operation is highly efficient (moving two references), without worrying about the effects of efficiency.
The advantage of using this ring-shaped data structure is that, in addition to filling the characters into the buffer, the extra copy of the data is completely avoided, either forward or return is a Drop operation with only a pointer (reference) operation, which is highly efficient. When the Drop is more timely, only two buffers are used and no additional memory is consumed. When too many buffers are in use, it is also possible to proactively free up excess memory (which is not considered here now).
The downside is that it can be more complicated to implement, and it requires careful handling of first, current, and last relationships, as well as FirstIndex, index, and Lastlen scope limits, and sometimes more than one buffer operation.
The complete code is visible SourceReader.cs.
Second, the Code positioning
When parsing the source code, it is obviously necessary to record the line number and the column number of each Token, and no one will like to face a lot of Error, but also don't tell you exactly what is wrong ... Therefore, I think code positioning is absolutely necessary function of lexical analysis, so it is directly built into the Sourcereader class.
Here's how to implement code positioning. Code positioning contains three-dimensional data: index, line number, and column number. Index is a 0-based character index, mainly for the convenience of the process, the line number and column number is 1, mainly for people to see.
Line positioning is simple, Unix line break is ' \ n ', Windows line break is "\ r \ n", so the number of direct statistics ' \ n '.
Next is the column positioning. To achieve better results, two factors need to be considered: full-width, half-width, and Tab characters.
A Chinese character (that is, the full-width character) corresponds to two columns, and the English character (half-width character) corresponds to a column, so that each column is aligned up and down in the equal-width font. When calculating the number of columns, it is also natural to use Encoding.Default.GetByteCount () instead of the length of the string. But here I found a memory problem (see here for details) and use Encoding.Default.GetEncoder () to GetByteCount method.
The length of a tab character is variable (typically 4 or 8, which varies from person to person), so a tabsize is defined to represent the width of the tab character. So, does a Tab character correspond to the Tabsize column? This is not the case, although it is generally the case, but in fact, the Tab character is the column that corresponds to the next character is always the tabsize integer times plus 1. If Tabsize = 4, then it behaves as shown, where A and Bcc are followed by two tab characters, and BCCCCCC and BCCCCCCC are followed by a tab character, and each tab character is marked with a gray arrow.
Figure 7 Tab Character instance
Therefore, the actual column number should be calculated using the following formula, where Currentcol is the column where the Tab character is located, and Nextcol is the column where the next character is located:
1 |
nextCol = tabSize * (1 + (currentCol - 1) / tabSize) + 1; |
Code positioning is calculated, and then the timing of the calculation. If the position of the current character is computed every time the Read is calculated, the efficiency of the calculation is slightly lower because the GetByteCount method calculates the efficiency of a single character array at a time, almost as many times as a character array of length 1. Second, what should I do when I return? If the previous position calculation results are saved, memory consumption will be a problem, if not considered, it is not possible to calculate the position of the previous character based on the position of the current character (such as the current character in the first column, the previous character should be in the column?). )。
After a comprehensive consideration, I decided to put the calculation of the code location into the drop operation (same as Accept and Accepttoken), one to the above, the computational efficiency would be slightly higher, the other is generally only when a Token is identified before it needs to be positioned, it happens to be Drop or The timing of accepttoken, the identification of tokens in the process of positioning is also not useful.
I've encapsulated the code-positioning functionality in the SourceLocator.cs class individually.
The next article will describe the regular expressions used in lexical analysis, and how to parse regular expressions.
The relevant code can be found here, and some base classes (such as input buffering) are here.
C # lexical Analyzer (ii) input buffering and code positioning