C # lexical Analyzer (i) Introduction to lexical analysis

Last Update:2018-06-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Series Navigation

(a) Introduction to lexical analysis
(ii) input buffering and code positioning
(c) Regular expressions
(iv) Construction of NFA
(v) Conversion of DFA
(vi) Structural lexical analyzer
(vii) Summary

Although the title of the article is lexical analysis, the first thing is to say from the compilation principle. The principle of compiling should be heard by many people, though not necessarily how well understood.

Simply put, the principle of compiling is to study how to compile--and how to convert from code (*.cs file) to a program that the computer can execute (*.exe file). Of course, some languages, such as JavaScript, are interpreted and executed, and its code is executed directly without the need to generate executable programs.

The compilation process is very complex, it involves a lot of steps, and directly take the "compilation Principle" (compilers:principles, Techniques and Tools, Red Dragon book) on the graph to see:

Figure 1 The various steps of the compiler, in fact, I based on the book on the picture synthesis of a post-painting

Here are 7 steps (the next optimization step is optional), where the first 4 steps are the analysis section (also known as the front end Front end), which is the decomposition of the source program into several constituent elements, and the grammatical structure on these elements, and finally the information stored in the symbol table. The latter three steps are the composite (also the backend back end), which constructs the desired target program based on the information in the intermediate representation and symbol table.

The benefit of dividing the compiler into so many steps is that it makes each step simpler, makes the compiler easier to design, or leverages many existing tools-such as a parser that can be generated with Lex or Flex, which can be generated with YACC or Bison. Almost without doing too much coding work can get a syntax tree, the front end of the work is almost finished. As for the backend, there are many existing technologies that can be used, such as ready-made virtual machines (CLR or Java, as long as they are translated into the appropriate IL).

This series of articles, that is, the first step in compiling the principle: syntax analysis. Most of the algorithms and theories come from the principles of compiling, the rest of them are made out of themselves, or the implementation of Flex is referred to (flex here refers to Fast lexical analyzer generator, a well-known program that provides lexical analysis, and is not Adobe's Flex).

I will try to fully describe the process of writing the lexical analyzer, including some details of the implementation. Of course, at present only according to the regular expression definition to get a can be used for lexical analysis of the object, in order to achieve the same as Flex directly according to lexical definition file generation lexical Analyzer source code, there is much work to do, not in the short term can be done.

This article, as the first article of the series, will be a comprehensive overview of the analysis of the Word method, introducing the techniques used and the general process.

First, lexical analysis introduction

Lexical parsing (lexical analysis) or scanning (scanning) is the first step in a compiler. The lexical parser reads the stream of characters that make up the source program, organizes them into a sequence of meaningful morphemes (Lexeme), and produces lexical units (tokens) as output for each morpheme.

In simple terms, the lexical analysis is to read the source program (which can be considered a very long string) and "cut" into small segments (each section is a lexical unit token), each of which has a specific meaning, for example, to represent a particular keyword or to represent a number. And this lexical unit in the source program corresponding to the text, is called "morphemes."

Using a calculator For example, 12+34*9 This section of the "source program" lexical analysis process is as follows:

Fig. 2 Lexical analysis process of calculation

A string that is not meaningful to a computer, and has been parsed to give a slightly meaningful Token stream. Digit means that the lexical unit corresponds to a number, the operator is the operator, and the corresponding number and symbol (pink background) is the morphemes. At the same time, some unnecessary blanks and annotations in the program can be filtered out by the lexical analyzer, so that the subsequent steps of parsing are much easier to process.

In the actual program, the lexical unit will be enumerated or numeric to indicate what kind of lexical units this is. The definition of my token<t> class is as follows:

1234567891011121314 namespaceCyjb.Text { classToken<T> { // 词法单元的标识符，表示词法单元的类型。 T Id; // 词法单元的文本，即“词素”。 stringText; // 获取词法单元的起始位置。 SourceLocation Start; // 获取词法单元的结束位置。 SourceLocation End; // 获取词法单元的值。 objectValue; }}

The Id and Text attributes inside do not have to be interpreted much, and Start and End are used to locate in the source file (index, number of rows and columns), and value is only for the convenience of passing some values.

2014.1.8 Update: This token<t> class, the first definition is a Token structure, and the lexical unit identifier is represented using an int value. However, the individual believes that it is better to use enumeration types, because enumeration types are named so that each identifier is well-represented and has a compile-time check that can effectively prevent spelling errors.

Second, how to describe the morphemes

Now that lexical analysis can be used to separate morphemes, how is the morphemes described? Or, why 12, + and 34 are morphemes, and 1, 2+3 and 4 are not morphemes? This requires a pattern.

Pattern describes the possible forms of morphemes for a lexical unit.

That is, I have defined the digit pattern as "a sequence of one or more numbers", and the operator mode as "single + or * character", and the lexical analyzer knows that 12 is a morpheme, and 2+3 is not a morpheme.

Now, the pattern is generally represented by regular expressions (regular expression) , where the so-called regular expressions, and the usual regular expressions (such as System.Text.RegularExpressions.Regex Class) in the same form and with more limited functionality, it contains only the ability to match strings, without the ability to group, reference, and replace. As a simple example, a + this regular expression represents "a sequence of one or more character a". For more detailed information on regular expressions, I'll list them in a later article, and of course, a limited reference to System.Text.RegularExpressions.Regex is also possible.

The regular expressions mentioned in the articles later in this series refer to the regular expressions that only have the ability to match strings, and we must be careful not to confuse them with System.Text.RegularExpressions.Regex.

Third, how to construct the lexical analyzer

Having finished the description of morphemes, we can construct a lexical analyzer based on the description of morphemes. The approximate process is as follows:

Figure 3 Constructing the lexical analyzer

From the point of view, a regular expression of the pattern is defined, and a conversion table is obtained through NFA conversion, DFA conversion and DFA simplification. This conversion table, coupled with a fixed DFA simulator, forms the lexical analyzer. It constantly reads characters from the input buffer and uses automata to identify the morphemes and output them. It can be said that the essence of lexical analysis is how to get this conversion table.

Said so much, lexical analysis is a simple introduction, from the next beginning, is how to step-by-step implementation of a complete lexical analyzer. The relevant code can be found here, and some base classes (such as input buffering) are here.

C # lexical Analyzer (i) Introduction to lexical analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C # lexical Analyzer (i) Introduction to lexical analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C # lexical Analyzer (i) Introduction to lexical analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support