Although the title of the article is lexical analysis, but first of all to say from the principle of compiling. The principle of compiling should be many people have heard, although not necessarily how to understand.
Simply put, the principle of compiling is to study how to compile--and to convert from code (*.cs file) to a program that a computer can execute (*.exe file). Of course, some languages such as JavaScript are interpreted and executed, and its code is executed directly, without the need to generate executable programs.
The compilation process is very complex, it involves a lot of steps, directly with the "compiler principle" (Compilers:principles, Techniques and Tools, the Red Dragon book) on the map to see:
Fig. 1 The steps of the compiler are actually based on a picture of the book
Here are 7 steps (the next optimization step is optional), the first 4 steps are the analysis section (also known as the front-end Front end), which decomposes the source program into several constituent elements, adds a syntactic structure to the elements, and then stores the information in the symbol table. The latter three steps are the integrated section (also the backend back end), which constructs the desired target program based on the intermediate representation and the information in the symbol table.
The benefit of dividing the compiler into so many steps is to make each step simpler, making the compiler easier to design, or leveraging many existing tools-for example, the lexical analyzer can be generated with Lex or Flex, and the parser can be built with YACC or Bison, Almost do not have to do too much coding work can get a syntax tree, the front end of the work is almost done. As for the backend, there are many existing technologies that can be used, such as off-the-shelf virtual machines (CLR or Java, as long as they are translated into the corresponding IL).
This series of articles, said is the first step of the compiler principle: grammar analysis. Most of the algorithms and theories come from the "Principles of compiling," and the rest is done by yourself, or by reference to the flex implementation (where Flex refers to fast lexical analyzer generator, a well-known program that provides lexical analysis, not Adobe Fl ex).
I will try to complete the process of writing the lexical analyzer, including some details of the implementation. Of course, at present only according to the definition of regular expression can be used for lexical analysis of the object, to achieve Flex as directly according to the lexical definition file generated lexical analyzer source code, there is still a lot of work to do, not in the short term can be done.
This article, as the first of the series, will be a comprehensive overview of the word Law analysis, which introduces the technology used and the approximate process.
First, the lexical analysis introduction
Lexical parsing (lexical analysis) or scanning (scanning) is the first step of the compiler. The lexical analyzer reads the stream of characters that compose the source program and organizes them into a sequence of meaningful morphemes (Lexeme), and produces lexical units (token) for each morpheme as output.
In simple terms, lexical analysis is the source program (can be considered a very long string) to read in, and "cut" into a small paragraph (each paragraph is a lexical unit token), each unit has a specific meaning, such as the expression of a specific keyword, or represent a number. And this lexical unit in the source program corresponding to the text, it is called "morphemes."
Using a calculator to illustrate, 12+34*9 this section of the "source program" lexical analysis process as follows:
Fig. 2 Lexical analysis process of the formula