Atitit. Develop your own compilers and interpreters (1) A summary of the lexical analysis of the--------attilax

Source: Internet
Author: User
Tags fsm

Atitit. Develop your own compilers and interpreters (1) A summary of the lexical analysis of the--------attilax

1. Application Scenario :::D SL greatly improve the development efficiency 1

2.2. The process is as follows ::: Lexical analysis (generate token stream) >>>> parsing (generating AST) >> Interpreting execution ... 2

3. How do I perform lexical analysis ? FSM State Machine ( automatic machine )2

4. Build FSM using state mode (simple, easy to use). Recommended preferred) 2

5.---code (state mode build FSM) 3

6. Lexical Analysis Concepts 3

6.1. Lexical analysis (English:lexicalanalyses) and token4

6.2. Five types of tokens 4

7. Other lexical analysis methods 5

7.1. Switchcase or IfElse 5

7.2. Status Table 5

7.3. Using NFA,DFA to build FSM(professional method, difficult) 6

7.3.1. Limitations of the DFA 7

8. Reference 7

1.Application Scenarios:::D SL greatly improves development efficiency

In order to greatly improve the development efficiency, the use of a large number of DSLs, you need to implement their own compilers, interpreters.

String s = "@QueryAdptr (sqlwhere=\" clo1= ' @p ' \ ", prop2=\" v2\ ") @Nofilt";

Create an environment

so , to parse annotations ...

net on the horse ,, The son can be a self-fulfilling blue ....

if Java the annotations in the source code can be used Java API Read the ...

HTML the annotations in the GA self-Realization Blue .

Author :: Old Wow's paw attilax ayron, email:[email protected]

Reprint please indicate source: Http://blog.csdn.net/attilax

2.2.The process is as follows::: Lexical analysis (generate token stream) >>>> parsing (generating AST) >> Interpreting execution ...

3. How to ProceedLexical Analysis?FSMState Machine(Automatic Machine)

A: A very simple idea is to use a state that is stored in a state that is processed to each character, such as an identifier or a number or a space, until the state is changed to a time when it can be assumed to be a different token.

To be sure, it is necessary to determine the data that needs to be processed in one character, then truncate it in the proper place and get a token.

The core here is to distinguish the characters corresponding to the different symbols, to truncate the symbol when a character cannot express it, andtoken to form.

4.Building with State modeFSM (simple, easy-to-use: Recommended Choice)

This mode uses code to implement FSM simple and easy to use. Recommended Choice

Using this method, Attilax first implements lexical analysis, and it only takes one day

5.---code (State Mode ConstructionFSM)

Public static List gettokenlist () {

String s  =  "@QueryAdptr (sqlwhere=\" clo1= ' @p ' \ ", prop2=\" v2\ ") @Nofilt" ;

s = "@qu (at1=\" v1\ ", at2 = \" V2 ABC \ ", at3=\" v3\ ")" ;

Create an environment

Annopasercontext Context  =  New Annopasercontext ();

Setting the status to the environment

Create State

Context. SetState (new inistate ());

int n=0;

while (!(  Context . State instanceof finishstate))

{

//  System.out.println (n);

Request

Context . Request (s);

N ++;

if (n>200)

Break ;

}

for (Token tk : context. ) tokenlist ) {

// if (Tk.value.trim (). Length () >0)

System. out . println (tk. value + "" );

}

return (List) Context . tokenlist ;

}

6. Lexical Analysis Concepts

6.1. Lexical Analysis (English:Lexical Analysis) withtoken

is the process of converting character sequences into Word (Token) sequences in computer science . The procedure or function for lexical analysis is called the lyrics Analyzer (Lexical Analyzer, referred to as Lexer), also known as a scanner (Scanner ). The lexical parser is generally present as a function for the parser to invoke.

The word here is a string that is the smallest unit that forms the source code . The process of generating a word from an input character stream is called Tokenization, in which the lexical analyzer also classifies words.

Lexical analyzers generally do not care about the relationship between words (in the context of syntactic analysis), for example: the lexical analyzer can recognize parentheses as words, but does not guarantee that the parentheses match.

Lexical parsing (lexicalanalysis) or scanning (scanning) is the first step in a compiler. The lexical parser reads the stream of characters that make up the source program, organizes them into a sequence of meaningful morphemes (lexeme), and produces lexical units (tokens) as output for each morpheme.

In simple terms, the lexical analysis is to read the source program (which can be considered a very long string) and " cut " into small segments (each section is a lexical unit token) , each cell has a specific meaning, such as representing a specific keyword or representing a number. And this lexical unit in the source program corresponding to the text, is called " morphemes ."

Token is the word that makes the sentence of the program similar to the word segmentation.

6.2. Five Tokenstype

The following is the definition of the regular expression for the above word.

Type

Regular expressions

Example

Keyword string

String

String

Identifier (variable name)

[A-z] [a-z0-9]*

Str

Equals

=

=

string literal constants

"[^"]*"

"Hello World"

Semicolon

;

;

The next question is, how do you use regular expression rules for lexical analysis? Regular expressions help us understand the rules of a word, but we cannot parse the string directly. To do this we will introduce the concept of a poor automaton to really handle the input string.

7. Other lexical analysis methods

7.1.? switchcase or IfElse

This is not intended to be the most intuitive way, using a bunch of conditions to judge, can be programmed to do, for the simple small state machine is the most suitable, but no doubt, such a way relatively primitive, the large state machine is difficult to maintain.

But the two functions of checkstatechange () and performstatechange () are still inherently bloated in the face of complex state, It may even be difficult to achieve.

In a long period of time, the use of switch sentence is always the only way to implement finite state machines, and even a complex software system such as compilers, most of which are directly implemented by this method. However, with the gradual deepening of the application of state machine, the state Machine is more and more complex, this method also began to face a variety of severe tests, the most troubling is if the state machine state is very much, or the transition between States is unusually complex, then simply use The state machine constructed by the switch statement will be non-maintainable.

7.2.Status Table

, the tools of regular expression, regular language and DFA are introduced. This time we're going to start with the most important phase of the compiler front--parsing. In simple terms, this step is to completely analyze the grammatical structure of the entire programming language. The result of lexical analysis is that the input string is decomposed into a stream of words, meaning words such as keywords, identifiers, and so on.

7.3. UseNFA,DFA Build FSM(professional method, difficult)

the DFA is actually the advanced version of the state table

The performance of a configurable lexical analyzer that is accomplished using the DFA method is fairly good

In general, the implementation of a higher performance DFA is a two-dimensional table. The row represents the character, and the column represents the state of the DFA, and the cell represents the target state that the state is transferred after a character has been entered. In addition, there is a table to record which states correspond to which rules end state

Deterministic finite automaton ( English : Deterministic finite automaton, DFA) is an automaton capable of realizing state transfer. For a given state belonging to the automaton and a character belonging to the automaton's alphabet, it can be transferred to the next state based on a given transfer function (this state can be the previous state)

This model is called a poor automaton (finite automation,FA), sometimes called a poor state machine (finite ).

, there is also a non-deterministic and poor automaton (NFA),

string1 = null; The string1 in this code is an identifier that represents a variable. If the lexical scanner has just scanned a string, the report finds the keyword string, and the logic here is wrong. If you wait until the DFA state is switched to the stop state, then you can judge the longest possible word. For example, when the parser parses a string, it still has no downtime, and then enters the next character "1", at which point the state of the lexical analysis transitions from accepting the state of the "keyword string" to accepting the "identifier" state; then the lexical analyzer sends a character that is a space, and the space After the string1 is not any legal word, so it will go to the shutdown state. Finally, we judged that the last state before the outage was to accept the "identifier" status, so the report successfully scanned the identifier string1. This allows for the longest match to be achieved.

First, we look at the lexical analysis of Minisharp. words in Minisharp languages can be divided into the following five categories according to their priority and different types:

1. Keywords

2. Identifiers

3. Integer numeric constants

4. Various punctuation marks

5. Whitespace characters, line breaks, and comments

must be recognized as reserved word, identifier ( variable ) , constant, operator (operator?) And bounded by five major classes 2

We now need to look for a tool that can describe the token type, before we begin with a look at the structure of the common notation. To represent a collection of strings that have some commonality, we need to write some rules that represent a collection of strings. All members of this collection will be considered to be a specific type of token.

, it is not only a lot of work but also a slow speed to match with regular expressions directly. So we also need another type of expression specifically designed for machines. In a later chapter, an algorithm is given to convert the regular expression into a machine-readable form, which is described in this chapter as a state-of-the-poor automaton.

From regular expressions to Ε-nfa

7.3.1. Limitations of the DFA

The DFA is a practical computational model because there is an online algorithm that simulates the DFAon the input stream with a trivial linear time and constant space . Given two DFA There are effective algorithms to find the DFA that identifies the set, intersection, and complement of the language they recognize . There is also an effective algorithm to determine whether a DFA accepts any given string, whether an DFA accepts all strings, two DFA Whether the same language is recognized, and the minimum number of DFA(minimum DFA) is found for a particular regular language .

On the other hand,DFAThere are strict restrictions on the language that can be recognized-many simple languages, including any problems that require more than constant space to solve, cannot beDFAidentification. The classicDFAAn example of a simple language that is not recognized is the bracketed language, which is a language that consists of correctly paired parentheses, such as(()()). by shape likeThe language of a anbn string is a finite number ofa, followed by an equal number ofb. Can prove noDFAthere is enough state to recognize this language (in layman's words, because it requires at least2na state, andNis not constant).

8. Reference

NFA_DFA algorithm-in pool's grocery store-blog channel-CSDN.NET.htm

DIY Development Compiler (ii) regular language and regular expressions - assemble head - Blog Park . htm

Atitit. Finite state machine FSM State mode -Attilax 's Column -  Blog Channel -CSDN.NET.htm

Atitit. lexical analysis of the implementation of token attilax summary   - attilax -  blog channel   - csdn.net.htm

Atitit. Annotation Parsing (1)--------- Lexical Analysis Attilax Summary java. net-attilax 's Column -  Blog Channel -CSDN.NET.htm

Atitit. Develop your own compilers and interpreters (1) A summary of the lexical analysis of the--------attilax

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.