C # lexical analyzer (2) Input buffering and code locating

Source: Internet
Author: User

I. input buffer
Before introducing how to perform lexical analysis, let's talk about a problem that is not very mentioned-how to read the sequence stream from the source file. Why is this issue so important? The reason is that there is a requirement for the merge stream in lexical analysis, and it must be able to support the rollback operation (that is, multiple characters are put back into the stream and will be read again later ).

Let's explain why the rollback operation is needed. For a simple example, we need to match the two modes:

 

Figure 1 stream rollback Process

The above is a simple matching process. To demonstrate the rollback process only, we will explain in detail how to match the phoneme when implementing the DFA simulator later.

Now let's take a look at the input-related classes in C #, which support Stream search, but can only be accessed in bytes. Although BinaryReader and TextReader support reading characters, however, rollback is not supported. Therefore, you must complete the input buffer class by yourself. The general idea is to use TextReader as the underlying character input, and then use your own class to support the rollback capability.

The compilation principle provides a buffer pair method. Simply put, two buffers are opened and the buffer size is set to N characters. Each time, N characters are read into the buffer, and character operations are performed on the buffer. If the data in the current buffer zone has been processed, read N new characters into another buffer zone, and then perform another operation as a new buffer zone.

This data structure is highly efficient, and the rollback function can be easily implemented as long as appropriate pointers are maintained. However, its buffer size is fixed, and the new characters will overwrite the old ones. If the number of characters to be rolled back is too large (for example, when analyzing long strings), errors may occur. I solved the problem of overwriting old characters by using multiple buffers-if the buffer zone is insufficient, a new buffer zone is opened, instead of overwriting the old data.

If you only add a buffer continuously, the occupied memory will only increase. This makes no sense. Therefore, I have defined three buffer release operations: Drop, Accept, and AcceptToken. The role of Drop is to mark all data before the current location as invalid (discarded), and the buffer occupied by the marked invalid data is released, which can be reused; accept returns invalid data in the form of a string instead of simply discarding it. Similarly, AcceptToken returns invalid data in the form of a Token, to facilitate lexical analysis.

Such a data structure is similar to the deque in STL. However, you do not need to randomly access, insert, or delete data here, and only perform operations at the beginning and end of the data, therefore, I directly connect multiple buffers to a ring using a two-way linked list and use three pointers current, first, and last to point to the buffer with data in the linked list, as shown in:

 

Figure 2 A linked list composed of multiple buffers. The red part indicates data, and the white part does not.

First points to the earliest data buffer, last points to the latest data buffer, current points to the currently accessed data buffer, and current always points to [first, within the range of last. The red part between firstIndex and lastLen is the buffer that contains valid data. idx indicates the characters currently being accessed. The white part indicates that the buffer is empty or the data in the buffer is invalid.

When the next character needs to be read, the data will be read from current in sequence and the idx will be moved back. If the data in current has been read, Move current to last (here, because there may be multiple buffers between current and last), and also move idx accordingly.

 

Figure 3 current migration to last

If you want to continue reading characters, but there is no new data in the current, and current is already the same as last, it indicates that there is no updated data in the buffer, then you need to read data from the TextReader, put it in the new buffer, and move the current and last at the same time (ensure that the last always points to the latest buffer ).

 

Figure 4 current and last backward shifting

Now let's take a look at the rollback operation. When performing a rollback, you only need to move current to the first direction (likewise, there may be multiple buffers between current and first ).

 

Figure 5 rollback operation

The implementation of the Drop operation (the same is true for Accept and AcceptToken) is also very simple. You only need to move first to the current location and the firstIndex to idx, which means that the data before idx is regarded as invalid data.

 

Figure 6 Drop operation

It should be noted that, after the Drop operation is completed, the invalid data may be overwritten by new data. Therefore, we should make sure that the data is no longer needed and then perform the Drop operation. The Drop operation is highly efficient (moving two references), so you don't have to worry about affecting the efficiency.

The advantage of using this ring data structure is that in addition to filling the characters in the buffer zone, it completely avoids additional data replication. No matter whether it is forward, rollback or Drop operations, only pointer (reference) operations are allowed, high efficiency. When Drop is compared and always, only two buffers are used, and no additional memory is occupied. When too many buffers are occupied, extra memory can be automatically released (not considered here ).

The disadvantage is that the implementation is more complicated. You need to carefully handle the relationship between first, current, and last, as well as the firstIndex, index, and lastLen range restrictions, and sometimes involves operations on multiple buffers.

The complete code can be seen in SourceReader. cs.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.