second-Lexical Analysis Compiler phase
- SOURCE program, compiler --Target program
- Compiler: front-end, middle representation, back end
- Front End: Lexical analyzer---notation--Syntax analyzer
- Middle representation: abstract Syntax tree
- Backend : semantic Parser
- Lexical Analyzer: A program code, the main function is to turn the character flow into a tick stream
- Lexical Analyzer:
- Character Stream Input: if (x > 5)
- Lexical analysis results: IF lparen IDENT (x) GT INT (5) Rparen
- Lexical analysis steps:
- Converts a character stream into a compiler-defined internal data structure that encodes the lexical units that can be identified
- Read I, identify I then convert it to a data structure: k=ident;lexeme=i;
- Read F, recognize F and convert it to a data structure: k=ident;lexeme=f;
- Read into the space, go to the end state, look up the table recognition keyword, return keywords: IF;
- Read in (, identify (then convert it to a data structure: k=lparen;lexeme=nil;
- Read x, recognize X and convert it to a data structure: k=ident;lexeme=x;
- Read into the space, go to the final state, look up the table recognition keyword, return identifier: IDENT (x);
- ......
- Definition of data structures for tokens
enum kind { ... };struct token { enum kind k; char *lexeme; }
The implementation method of lexical analyzer
- Two scenarios:
- Manual coding: How to convert yourself to write code implementation
- Relatively complex and error-prone
- Currently very popular implementation method: GCC,LLVM ...
- The parser generator: Enter its declarative specification to automatically generate the lexical Analyzer implementation code
- Rapid prototyping, low code volume
- But more difficult to control the details
Manual encoding state transition diagram for lexical analyzers
- Identify word symbols in high-level languages
- State transition Graph algorithm
token nextToken() c = getChar(); switch(c) ‘<‘: c = getChar(); switch(c) ‘=‘return LE; ‘>‘return NE; default: rollback();return LT; ‘=‘return EQ; ‘>‘: c = nextChar(); switch(c) ...
- Identifier conversion diagram
- The initial state of the character is a letter or an underscore, converted to 1 state, if it is a letter or number or underscore, the execution of closures, if the other characters into the final state, return the tag number, fallback other characters to the initial state for the next round of recognition
- Keyword table algorithm
- Constructs a hash table for all keywords in a given language
- Identify all identifiers in accordance with the keyword, first by the state transition diagram of the identifier
- After the identification, further check the table H to see if it is a keyword
- By reasonably constructing hash table H (perfect hash), can be completed in O (1) time
Automatic generation of regular expressions for lexical analyzers
- Formalization of the concept of state transition diagram to facilitate the automatic generation of lexical analyzers
- ∑={C1,C2,C3...,CN} for a given character set
- Inductive definition: One of the first two is basic, the last three kinds are inductive
- The empty string ε is a regular expression
- Regular expression for any c∈∑,c
- Select m| N = {M,n}
- Connection MN = {Mn|m∈m,n∈n}
- Closed Package m* = {ε,m,mm,mmm,...}
- The form of regular expressions, of which the first two are basic, the last three kinds are inductive
1. 2. | c (c∈∑)3. | e | e4. | e e5. | e*`
- Question: What regular expressions can be written for a given character set ∑={a,b}
1. ε 2. a,b 3. ε|ε, ε|a,... 4. εa,εb,ab,εε,a(ε|a),a(ε|b),... 5. (a(ε|a))*,ε闭包...
- Example: Keyword ∑= ASCII
ifi ε ∑, f ε ∑,i与f之间存在一个连接符,连接后依然是正则表达式inti ε ∑, n ε ∑,t ε ∑, 它们之间存在两个连接符,由正则表达式的归纳可知它们依然是正则表达式
- Expressed in regular expressions: (26+26+1) (26+26+10+1)
- (a|b|...| z| a| b|...| z| underline) (a|b|...| z| a| b|...| z|_|1|2|...| 0)
- Syntactic sugars: All can be evaluated with regular expressions, but singular expressions are easy to use
Finite state automata (FA)
- A more generalized state transition diagram, divided into
DFA
andNFA
- Input string, FA, {Yes, No}
- M = (∑,s,q0,f,δ)--(alphabet, state set, initial state, end state set, transfer function)
- DFA Automata Example: for any character, up to one state can be transferred
- Alphabet: {A, B}
- State set: {0,1,2}
- Initial state: 0
- End State set: {2}
- Transfer function:
{(q0,a)->q1,(q0,b)->(q1,a)->q2,(q1,b)->(q2,a)->q2,(q2,b)->q2, ... }
- NFA Automata Example: for any character, more than one state can be transferred
-Alphabet: {A, B}
-Status set: {0, 1,}
-Initial state: 0
-End State set: {1}
-Function transfer
... }
- Implementation of the DFA
NFA, RE
- Automatically generated
- Declarative specification, NFA---lexical analyzer DFA
- (Thompson algorithm), NFA (subset construction algorithm), DFA (minimization algorithm), lexical Analyzer code
- Thompson algorithm
- Direct construction of the basic re (e->ε; e-c)
- Re-recursive construction of the compound (composite: A (b|c) *, recursive to the most basic re-direct construction)
- Example: A (b|c) *
- Split from left to right
- Final result M{{a,b,c},{0-9},0,9,δ}
NFA, DFA
- Subset Construction algorithm
Q0 <-eps_closure (n0)//Request N0 Status of ε_ closure, q0 = {N0};Q <-{q0}//Q = {q0};Worklist <-q0 while(worklist! = [])//When the worksheet is not emptyRemove Q fromWorklist//Remove an element of the worksheetforeach (characterC//256 characters to do a loopT <-e-closure (Delta (q,c))//Change node, then find the node closureD[Q,C] <-t//(Q0,C), Q1 if(t\ not\inchQ)AddT toQ andWorklist//If the subset T is not included on the set Q, add it to Q
- Fixed point algorithm, able to run terminate q={q0,q1,q2,q3,...} The element is finite and the number of subsets is 2^n
- The worst-case time complexity is O (2^n)
- Does not happen very often in practice, because not every subset will appear
- Algorithm steps:
- In the starting state q0 read into the ∑ of any one character, can reach the state of the node, and then find the ε_ closure of the node, the two parts described above is the scope of the Q1 subset
- In the elements of the Q1 subset read into the ∑ of any character, can reach the state of the node, and then the ε_ closure of the node, the two parts described above is the scope of the Q2 subset
- Continue to beg until the character without the alphabet is not
- Q={q0,q1,q2,q3,...}
- Ε_ closure calculation: Depth first, subset includes the node itself
/** 深度优先时间复杂度:O(N) */// 全局变量,集合,空集set closure = {};void eps_closure (x) closure += {x} // 把x加进集合 // x通过边ε到达y if (!visited(y)) // 如果y没走过,递归走y eps_closure (y)// 如果一开始有多个节点,则求多个节点的闭包之后求并集
Minimization of DFA: reduced edge, less state, fewer resources to use
- Algorithm: Hopcroft: an idea based on equivalence class
// S:一个状态的集合,split:切分split(S) foreach (character c) ifsplit S) splitinto T1,T2,...,Tkhopcroft () splitinto N,A // 把所有切分为两个不可相容的状态,接受和不可接受状态, while (set is still changes) split(s)
The code representation of the DFA
- Transfer table-adjacency matrix: state, character
- Hash table
- Lexical Analysis driver Code
- Longest match
- Jump table
- The specific choice depends on the actual, time-space tradeoff
Lexical analysis of second-