Lexical analysis of second-

Last Update:2016-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

second-Lexical Analysis Compiler phase

SOURCE program, compiler --Target program
Compiler: front-end, middle representation, back end
- Front End: Lexical analyzer---notation--Syntax analyzer
- Middle representation: abstract Syntax tree
- Backend : semantic Parser
Lexical Analyzer: A program code, the main function is to turn the character flow into a tick stream
Lexical Analyzer:
- Character Stream Input: if (x > 5)
- Lexical analysis results: IF lparen IDENT (x) GT INT (5) Rparen
- Lexical analysis steps:
- Converts a character stream into a compiler-defined internal data structure that encodes the lexical units that can be identified
- Read I, identify I then convert it to a data structure: k=ident;lexeme=i;
- Read F, recognize F and convert it to a data structure: k=ident;lexeme=f;
- Read into the space, go to the end state, look up the table recognition keyword, return keywords: IF;
- Read in (, identify (then convert it to a data structure: k=lparen;lexeme=nil;
- Read x, recognize X and convert it to a data structure: k=ident;lexeme=x;
- Read into the space, go to the final state, look up the table recognition keyword, return identifier: IDENT (x);
- ......
Definition of data structures for tokens

enum kind {    ...    };struct token {    enum kind k;    char *lexeme;    }

The implementation method of lexical analyzer

Two scenarios:
1. Manual coding: How to convert yourself to write code implementation
  - Relatively complex and error-prone
  - Currently very popular implementation method: GCC,LLVM ...
2. The parser generator: Enter its declarative specification to automatically generate the lexical Analyzer implementation code
  - Rapid prototyping, low code volume
  - But more difficult to control the details

Manual encoding state transition diagram for lexical analyzers

Identify word symbols in high-level languages
State transition Graph algorithm

token nextToken()    c = getChar();    switch(c)        ‘<‘: c = getChar();                  switch(c)                  ‘=‘return LE;                  ‘>‘return NE;                  default: rollback();return LT;        ‘=‘return EQ;        ‘>‘: c = nextChar();                  switch(c)                  ...

Identifier conversion diagram

The initial state of the character is a letter or an underscore, converted to 1 state, if it is a letter or number or underscore, the execution of closures, if the other characters into the final state, return the tag number, fallback other characters to the initial state for the next round of recognition
Keyword table algorithm
- Constructs a hash table for all keywords in a given language
- Identify all identifiers in accordance with the keyword, first by the state transition diagram of the identifier
- After the identification, further check the table H to see if it is a keyword
- By reasonably constructing hash table H (perfect hash), can be completed in O (1) time

Automatic generation of regular expressions for lexical analyzers

Formalization of the concept of state transition diagram to facilitate the automatic generation of lexical analyzers
∑={C1,C2,C3...,CN} for a given character set
Inductive definition: One of the first two is basic, the last three kinds are inductive
- The empty string ε is a regular expression
- Regular expression for any c∈∑,c
- Select m| N = {M,n}
- Connection MN = {Mn|m∈m,n∈n}
- Closed Package m* = {ε,m,mm,mmm,...}
The form of regular expressions, of which the first two are basic, the last three kinds are inductive

1.  2.  | c （c∈∑）3.  | e | e4.  | e e5.  | e*`

Question: What regular expressions can be written for a given character set ∑={a,b}

    1. ε    2. a,b    3. ε|ε, ε|a,...    4. εa,εb,ab,εε,a(ε|a),a(ε|b),...    5. (a(ε|a))*,ε闭包...

Example: Keyword ∑= ASCII

ifi ε ∑, f ε ∑,i与f之间存在一个连接符，连接后依然是正则表达式inti ε ∑, n ε ∑,t ε ∑, 它们之间存在两个连接符，由正则表达式的归纳可知它们依然是正则表达式

Example: Identifiers

Expressed in regular expressions: (26+26+1) (26+26+10+1)
(a|b|...| z| a| b|...| z| underline) (a|b|...| z| a| b|...| z|_|1|2|...| 0)
- Syntactic sugars: All can be evaluated with regular expressions, but singular expressions are easy to use

Finite state automata (FA)

A more generalized state transition diagram, divided into DFA andNFA
Input string, FA, {Yes, No}
M = (∑,s,q0,f,δ)--(alphabet, state set, initial state, end state set, transfer function)
DFA Automata Example: for any character, up to one state can be transferred

Alphabet: {A, B}
State set: {0,1,2}
Initial state: 0
End State set: {2}
Transfer function:

{(q0,a)->q1,(q0,b)->(q1,a)->q2,(q1,b)->(q2,a)->q2,(q2,b)->q2, ... }

NFA Automata Example: for any character, more than one state can be transferred

-Alphabet: {A, B}
-Status set: {0, 1,}
-Initial state: 0
-End State set: {1}
-Function transfer

... }

Implementation of the DFA

NFA, RE

Automatically generated
- Declarative specification, NFA---lexical analyzer DFA
- (Thompson algorithm), NFA (subset construction algorithm), DFA (minimization algorithm), lexical Analyzer code
Thompson algorithm
- Direct construction of the basic re (e->ε; e-c)
- Re-recursive construction of the compound (composite: A (b|c) *, recursive to the most basic re-direct construction)
Example: A (b|c) *
Split from left to right
- Final result M{{a,b,c},{0-9},0,9,δ}

NFA, DFA

Subset Construction algorithm

Q0 <-eps_closure (n0)//Request N0 Status of ε_ closure, q0 = {N0};Q <-{q0}//Q = {q0};Worklist <-q0 while(worklist! = [])//When the worksheet is not emptyRemove Q fromWorklist//Remove an element of the worksheetforeach (characterC//256 characters to do a loopT <-e-closure (Delta (q,c))//Change node, then find the node closureD[Q,C] <-t//(Q0,C), Q1    if(t\ not\inchQ)AddT toQ andWorklist//If the subset T is not included on the set Q, add it to Q

Fixed point algorithm, able to run terminate q={q0,q1,q2,q3,...} The element is finite and the number of subsets is 2^n
The worst-case time complexity is O (2^n)
Does not happen very often in practice, because not every subset will appear
Algorithm steps:
1. In the starting state q0 read into the ∑ of any one character, can reach the state of the node, and then find the ε_ closure of the node, the two parts described above is the scope of the Q1 subset
2. In the elements of the Q1 subset read into the ∑ of any character, can reach the state of the node, and then the ε_ closure of the node, the two parts described above is the scope of the Q2 subset
3. Continue to beg until the character without the alphabet is not
4. Q={q0,q1,q2,q3,...}
  - Ε_ closure calculation: Depth first, subset includes the node itself

/** 深度优先时间复杂度:O(N) */// 全局变量，集合，空集set closure = {};void eps_closure (x)    closure += {x}          // 把x加进集合    // x通过边ε到达y    if (!visited(y))        // 如果y没走过，递归走y      eps_closure (y)// 如果一开始有多个节点，则求多个节点的闭包之后求并集

Minimization of DFA: reduced edge, less state, fewer resources to use

Algorithm: Hopcroft: an idea based on equivalence class

// S:一个状态的集合，split:切分split(S)    foreach (character c)        ifsplit S)          splitinto T1,T2,...,Tkhopcroft ()    splitinto N,A // 把所有切分为两个不可相容的状态，接受和不可接受状态，    while (set is still changes)        split(s)

Example 1

Example 2

The code representation of the DFA

Transfer table-adjacency matrix: state, character
Hash table
Lexical Analysis driver Code
Longest match
Jump table
The specific choice depends on the actual, time-space tradeoff

Lexical analysis of second-

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lexical analysis of second-

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lexical analysis of second-

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support