[Compilation Principle] Chapter 3 lexical analysis

Source: Internet
Author: User

I. Functions of lexical analyzer

Lexical analysis is the first stage of compilation. The main task of the lexical analyzer is to read the input characters of the source program, form them into a word, generate and output a word UNIT sequence, and each word unit corresponds to a word element.

Analysis part: lexical analysis and syntax analysis (simplified compiler design, improved compiler efficiency, and enhanced compiler portability)

1) lexical unit: the lexical unit name and optional attribute values. Keywords, operators ......

2) mode: the form that the word element may have. When the word unit is a keyword, the mode is the Character Sequence of the keyword.

3) word element: a character sequence in the source program. It matches a lexical unit pattern.

4) lexical error: identifies an incorrect word and continues to judge the next word.

Ii. input buffer

1) We should look forward to at least one character before we can determine whether the current word is in the beginning.

2) dealing with large source programs requires processing a large number of characters. Processing usually takes a lot of time. We use two buffers that are read alternately. (For details, see 73)

Iii. Lexical unit Specification

1) will we use up the buffer? We usually connect long strings in the form of "+.

2) Regular Expression: letter _ (letter _ | digit) * Indicates 0 or more letters or numbers starting with a letter.

3) Regular Expression example
A | B {a, B}
(A | B) (a | B) {AA, AB, Ba, BB}
AA | AB | Ba | BB {AA, AB, Ba, BB}
A * All string sets composed of Letter
(A | B) * All string sets composed of A and B
Complex examples
(00 | 11 | (01 | 10) (00 | 11) x (01 | 10) * 01001101000010000010111001

3) Regular Expression Extension

1> one or more instances use the one-dimensional suffix operator "+", which means "one or more instances ", that is, the regular expression A + indicates the set of all strings of one or more. Operators + and operators * have the same priority and combination. The Algebraic Equations R * = R + | and R + = RR * express the relationship between the two operators.
2> zero or one instance unary suffix operator? It means "zero or one instance", r? Is the abbreviation of R |. If R is formal, then (r )? Is the regular expression that represents the language L (r) operator. Can we use num digit + (. digit + )? (E (+ | )? Digit + )? To describe the number of unsigned characters.
3> character group [ABC] (where A, B, and C are the symbols of the alphabet) indicates the regular expression a | B | C. Abbreviated character group [AZ] indicates regular expression a | B |... | Z. You can use the regular [Azaz] [azaz09] * to describe the identifier.


Iv. Lexical unit Recognition

Some statuses are accepted or final, indicating that a word has been found.

1) graph of link character conversion

2) Conversion of Reserved Words and identifiers


3) unsigned tree conversion chart


4) blank conversion chart



5. Lexical analyzer Generation Tool Lex

1) Lex is a very famous tool in UNIX environment. Its main function is to generate a C source code for the lexical analyzer. The description Rules use regular expressions (regular expressions ). Description file *. l of the lexical analyzer. After Lex compilation, a file of Lex. yy. C is generated and compiled by the C compiler to generate a lexical analyzer. Lexical analyzer, in simple terms, its task is to convert various input symbols into corresponding Identifiers (tokens) and converted identifiers
It is easy to be processed in subsequent stages.

2) use Lex to wear a lexical analyzer


3) Lex conflict resolution

Always select the longest prefix. If the longest prefix matches multiple modes, always select the mode listed first in the Lex program.

6. Finite automatic machines

1) A finite automatic machine can be used to describe the process of recognizing the pattern in the input string. Therefore, it can also be used to construct scanning programs. Of course, there is a close relationship between the finite automaton and regular expressions.

2) finite automatic machines are divided into two types: definite and uncertain. Nondeterministic Finite Automation (NFA) indicates that there is such a State, and there is more than one conversion for an input symbol. Both deterministic and uncertain finite automatic machines can recognize regular sets, that is, they can recognize other languages, which are exactly the languages that can be expressed by regular expressions.

3) NFA Composition


4) conversion table


5) from the regular expression r = (a | B) * ABB to NFA




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.