C # Introduction to lexical analyzer (I)

Source: Internet
Author: User

Series navigation

    1. (1) Introduction to lexical analysis
    2. (2) Input buffering and code locating
    3. (3) Regular Expressions
    4. (4) construct NFA
    5. (5) DFA Conversion
    6. (6) construct a lexical analyzer

AlthoughArticleThe title is lexical analysis, but it should be explained from the compilation principle first. Many people have heard about the compilation principle, although not necessarily familiar with it.

To put it simply, the principle of compilation is to study how to compile.Code(*. CS files) can be convertedProgram(*. EXE file ). Of course, some languages such as JavaScript are interpreted and executed, and their code is directly executed without generating executable programs.

The compilation process is very complex. It involves many steps and uses the compilation principle (Compilers: Principles, techniques and toolsIn the diagram of the Red Dragon book:

Figure 1 The steps of the compiler are actually based on the diagram in the book.

Seven steps are provided here (the optimization steps below are optional). The first four steps are the analysis part (also known as the Front End ), is to break down the source program into multiple components, add the syntax structure to these elements, and finally store the information in the symbol table. The next three steps are the integrated part (also known as back-end). They construct the expected target program based on the information in the intermediate representation and symbol table.

The benefit of dividing the compiler into so many steps is that each step is simpler, making the compiler easier to design, you can also use many existing tools-for example, the lexical analyzer can be generated using lex or flex, And the syntax analyzer can be generated using YACC or Bison, there is almost no need to do much coding to get a syntax tree, and the front-end work is almost done. As for the backend, there are also many existing technologies available, such as ready-made virtual machines (CLR or Java, as long as it is translated into the corresponding il ).

This series of articles describes the first step of compilation principles: syntax analysis. MajorityAlgorithmAnd the theory is from the compilation principle, and the rest is made by yourself, or by referring to the implementation of FLEX (Flex here refers to fast lexical analyzer generator, A well-known program that provides lexical analysisNoAdobe Flex ).

I will try to fully introduce the process of compiling the lexical analyzer, including the implementation of some details. Of course, at present, only one object can be defined according to the regular expression for lexical analysis. To achieve flex, the lexical analyzer can be generated directly based on the lexical definition file.Source codeThere is still a lot of work to be done, not in the short term.

As the first article in the series, this article will give a comprehensive overview of lexical analysis and introduce the technologies and general procedures used in it.

I. Introduction to lexical analysis

Lexical analysis or scan is the first step of the compiler. The Lexical analyzer reads the tokens that constitute the source program, organizes them into a meaningful lexeme sequence, and generates tokens for each element as the output.

In simple terms, lexical analysis refers to reading the source program (which can be considered a long string) and "cutting" it into small sections (each section is a token of the lexical unit ), each unit has a specific meaning, such as a specific keyword or a number. The corresponding text of this lexical unit in the source program is called a "word base ".

Take the calculator as an example. The lexical analysis process of the 12 + 34*9 "source program" is as follows:

Figure 2 computational lexical analysis process

A meaningless string for computers, after Syntactic Analysis, gets a slightly meaningful token stream. Digit indicates that the lexical unit corresponds to a number, operator indicates the operator, and the corresponding number and symbol (pink background) is the word element. At the same time, some unnecessary blank and comments in the program can also be filtered out by the lexical analyzer. In this way, it will be much easier to process subsequent syntax analysis and other steps.

In the actual program, the lexical unit uses enumeration or numbers to indicate which type of lexical unit is used. My token. CS is defined as follows:

 
Namespace cyjb. Text {struct token {// symbol index of the lexical unit, indicating the type of the lexical unit. Int index; // The text of the lexical unit, that is, the word base ". String text; // obtain the starting position of the lexical unit. Sourcelocation start; // obtain the end position of the lexical unit. Sourcelocation end; // obtain the value of the lexical unit. Object value ;}}

There is no need to explain the index and text attributes in it. start and end are used to locate (index, number of rows, and number of columns) in the source file, and value is set only for convenience of passing some values.

Ii. How to describe the phoneme

Now that we know that lexical analysis can separate the elements, how do we describe the elements? In other words, why are 12, +, and 34 elements, and 1, 2 + 3, and 4 elements not elements? This requires the mode.

Pattern describes the possible form of a word element.

That is to say, I have defined the digit mode as "a sequence composed of one or more numbers" and the operator mode as "single + or * character ", the Lexical analyzer knows that 12 is a word, and 2 + 3 is not a word.

Currently, the mode is generally usedRegular Expression (regular expression)This is a regular expression. text. regularexpressions. the RegEx class) has the same form, but has limited functions. It only contains the matching capability of strings, but does not have the ability to group, reference, or replace strings. For example, the regular expression A + represents a sequence consisting of one or more characters ". For more details about regular expressions, I will list them in the subsequent articles. Of course,LimitedFor more information, see system. Text. regularexpressions. RegEx.

The regular expressions mentioned in articles after this series refer to this regular expression that only matches strings. text. regularexpressions. the RegEx is obfuscated.

3. How to construct a lexical analyzer

After the description of the word element is completed, it is about how to construct the lexical analyzer according to the description of the word element. The general process is as follows:

Figure 3 create a lexical analyzer

The Regular Expression of the pattern is defined. A conversion table is obtained through NFA conversion, DFA conversion, and DFA simplification. This conversion table is combined with a fixed DFA simulator to form a lexical analyzer. It constantly reads characters from the input buffer, uses automatic machines to identify elements and output them. It can be said that the essence of lexical analysis is how to obtain the conversion table.

After talking about this, lexical analysis is a simple introduction. From the next article, we will explain how to implement a complete lexical analyzer step by step. Relevant code can be found here, and some basic classes (such as input buffer) are here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.