Lexical analyzer completed!

Source: Internet
Author: User

According to the lexical analyzer implemented by the limited automatic status converter (Khan, ZhenJie port, called determinate finite automan, hereinafter referred to as the state machine), currently it does not support generating state machines directly from regular expressions, generate manually (Manual ).

The entire program is written in standard C language and the platform can be transplanted. It can be divided into four modules: buffer zone, state machine, symbol table, and word generator.

Buffer: the purpose of setting the buffer is self-evident. It is used to reduce the number of file accesses and speed up the program. Of course, if our memory is infinitely large, it will be enough to load the file into the memory at a time, but the reality is that our memory is always limited in size, so our buffer cannot be too large. This will cause some problems: for example, we have reached the end of the buffer when searching for a word. If we simply read the source file from the file and load it into the buffer, the original buffer content will be overwritten, so the first half of the current search words will be discarded. The method I used is the buffer pair (buff pair) introduced in the Dragon book ). That is, the buffer is loaded in the unit of half the buffer size. When the buffer is read to the end of the first half of the buffer, the buffer is loaded to the first half of the buffer, and the pointer is adjusted to the buffer. This constitutes a buffer for loop use. Of course, this requires that the length of each word should not exceed half of the buffer length, which is easy to do, because the buffer zone is smaller (even if it is a mobile phone), can it be 1 K or more? Each word can contain 512 characters.

State Machine: the state machine is the central nervous system of lexical analysis. The entire process of lexical analysis is driven by the state machine. Analyzing a word is a state conversion process. The State always starts from the initial State (State 0) and jumps between States based on individual conditions (that is, the characters read one by one) to reach the final state (find the word ). Of course, there are some details, such as some termination states may be the intermediate state of another termination state. For example, <and <=, when reading the character '<', although it reaches the final state smaller than the number (find a smaller than the character), it cannot be determined, because if the character '=' is followed by a character, it should be regarded as a smaller than or equal sign, but it is not equal to '=', that is, it is a real less than sign, then we should roll back the character pointer and return a character smaller than the number. And so on .. Another issue about the access speed of the state machine is that the state machine implemented by a common Linear table traverses the next state every time it is searched. This is very inefficient, and this operation will be executed every time one character is read, the compilation process is frequently used, so the efficiency of this operation is crucial. There is a better algorithm in the Dragon book, that is, creating an index table (another array) for the state machine (an array in fact), with each status number as a subscript, the array content is the base address in the state machine. This improves the access speed. Hash Tables can also be used, but any implementation of hash tables will waste space, and the workload of creating a hash table is larger than that of creating an index table. Of course, using a third-party function library (or Class Library) is another thing.

Symbol table: We know that words such as the same keyword and variable name in a program often appear multiple times. It is unscientific to create a bucket for each occurrence, because that is a waste of space. We can introduce the concept of a symbol table. Each time you find a new word, you can first look for the symbol table. If the word already exists, the position or index number of the next word is directly returned. Otherwise, the word is inserted. Because the symbol table often needs to be inserted and searched, hash tables can be used to improve the access speed.

Word Analyzer: This module is the top layer of the lexical analyzer. It uses the nexttoken function to provide services for the upper layer of the lexical analyzer-the syntax analyzer. The implementation is that each time starting from the initial state, a single character is continuously read through the method provided by the buffer module, and then handed over to the state machine for the next state, and then continue to take the character, continue state conversion, until a word is found, the insert method of the symbol table is called, and the new one is inserted or the old one is referenced by the symbol table module. The word operator only gets one result, namely: the position of the word in the symbol table.

Now that the end of the course is approaching, all kinds of examinations are approaching, and the pressure on postgraduate entrance exams is also coming. The impulse to write a program is only one pressure. If you have time, you can connect the previously written bottom-up simple preference syntax analyzer (no lexical analysis at that time, assuming that each word is composed of one character) to the lexical analyzer.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.