[Compilation Principle] Chapter 3 lexical analysis

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Functions of lexical analyzer

Lexical analysis is the first stage of compilation. The main task of the lexical analyzer is to read the input characters of the source program, form them into a word, generate and output a word UNIT sequence, and each word unit corresponds to a word element.

Analysis part: lexical analysis and syntax analysis (simplified compiler design, improved compiler efficiency, and enhanced compiler portability)

1) lexical unit: the lexical unit name and optional attribute values. Keywords, operators ......

2) mode: the form that the word element may have. When the word unit is a keyword, the mode is the Character Sequence of the keyword.

3) word element: a character sequence in the source program. It matches a lexical unit pattern.

4) lexical error: identifies an incorrect word and continues to judge the next word.

Ii. input buffer

1) We should look forward to at least one character before we can determine whether the current word is in the beginning.

2) dealing with large source programs requires processing a large number of characters. Processing usually takes a lot of time. We use two buffers that are read alternately. (For details, see 73)

Iii. Lexical unit Specification

1) will we use up the buffer? We usually connect long strings in the form of "+.

2) Regular Expression: letter _ (letter _ | digit) * Indicates 0 or more letters or numbers starting with a letter.

3) Regular Expression example
A | B {a, B}
(A | B) (a | B) {AA, AB, Ba, BB}
AA | AB | Ba | BB {AA, AB, Ba, BB}
A * All string sets composed of Letter
(A | B) * All string sets composed of A and B
Complex examples
(00 | 11 | (01 | 10) (00 | 11) x (01 | 10) * 01001101000010000010111001

3) Regular Expression Extension

1> one or more instances use the one-dimensional suffix operator "+", which means "one or more instances ", that is, the regular expression A + indicates the set of all strings of one or more. Operators + and operators * have the same priority and combination. The Algebraic Equations R * = R + | and R + = RR * express the relationship between the two operators.
2> zero or one instance unary suffix operator? It means "zero or one instance", r? Is the abbreviation of R |. If R is formal, then (r )? Is the regular expression that represents the language L (r) operator. Can we use num digit + (. digit + )? (E (+ | )? Digit + )? To describe the number of unsigned characters.
3> character group [ABC] (where A, B, and C are the symbols of the alphabet) indicates the regular expression a | B | C. Abbreviated character group [AZ] indicates regular expression a | B |... | Z. You can use the regular [Azaz] [azaz09] * to describe the identifier.

Iv. Lexical unit Recognition

Some statuses are accepted or final, indicating that a word has been found.

1) graph of link character conversion

2) Conversion of Reserved Words and identifiers

3) unsigned tree conversion chart

4) blank conversion chart

5. Lexical analyzer Generation Tool Lex

1) Lex is a very famous tool in UNIX environment. Its main function is to generate a C source code for the lexical analyzer. The description Rules use regular expressions (regular expressions ). Description file *. l of the lexical analyzer. After Lex compilation, a file of Lex. yy. C is generated and compiled by the C compiler to generate a lexical analyzer. Lexical analyzer, in simple terms, its task is to convert various input symbols into corresponding Identifiers (tokens) and converted identifiers
It is easy to be processed in subsequent stages.

2) use Lex to wear a lexical analyzer

3) Lex conflict resolution

Always select the longest prefix. If the longest prefix matches multiple modes, always select the mode listed first in the Lex program.

6. Finite automatic machines

1) A finite automatic machine can be used to describe the process of recognizing the pattern in the input string. Therefore, it can also be used to construct scanning programs. Of course, there is a close relationship between the finite automaton and regular expressions.

2) finite automatic machines are divided into two types: definite and uncertain. Nondeterministic Finite Automation (NFA) indicates that there is such a State, and there is more than one conversion for an input symbol. Both deterministic and uncertain finite automatic machines can recognize regular sets, that is, they can recognize other languages, which are exactly the languages that can be expressed by regular expressions.

3) NFA Composition

4) conversion table

5) from the regular expression r = (a | B) * ABB to NFA

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Compilation Principle] Chapter 3 lexical analysis

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support