This is a creation in Article, where the information may have evolved or changed.
Lexical analysis is generally the first part of the compiler, and lexical analysis is very simple, is a finite state machine.
The process of starting a lexical analysis is the process of converting a source file into a set of pre-defined tokens.
This set of tokens is then fed into the parser for parsing, and here we focus primarily on lexical analysis.
There are several ways to do lexical analysis:
- Use tools like Lex directly.
- Use a lower-level regular expression.
- Use state actions to construct a state machine.
There is nothing wrong with using tools to actually implement a language, but the problem is that it's hard to get the right error hints.
Tool-generated error handling is weak. And you need to learn another rule or a specific grammar. The generated code is not likely to be easy to optimize, but the tool can be very simple to implement lexical analysis.
The early compiler design phase uses Lex to do lexical analyzers, such as GCC and go, but it is important to implement a lexical analyzer on your own by making a really productive language that may not depend on the generated code, but rather to make its own specific modifications and optimizations.
A topic of regular expression is the problem of efficiency, such as Perl has the most powerful regular expression, but the efficiency of the entire regular expression engine is very low, go in this respect to sacrifice some of the characteristics of regular expressions to ensure that the efficiency of the regular expression is not too low, But the regular expression for a large number of text processing of the weak is very obvious. Because it is possible that the state we are dealing with does not need a heavy regular expression to solve.
In fact, the implementation of a lexical analyzer is very simple, and this skill is basic will not change, if written once, and later are the same way of implementation.
First look at the implementation of go, in the go to the source below the go/token/token.go
directory is so defined token.
// Token is the set of lexical tokens of the Go programming language.type Token int
is actually an enumeration type, with tokens for each type of literal value.
In fact, this can only be considered a token type.
// The list of tokens.const ( // Special tokens ILLEGAL Token = iota EOF COMMENT literal_beg // Identifiers and basic type literals // (these tokens stand for classes of literals) IDENT // main INT // 12345 FLOAT // 123.45 IMAG // 123.45i CHAR // 'a' STRING // "abc" // 省略)
Enumerates all token types that can be encountered.
go/token/position.go
This is a definition of the token location-related.
// -----------------------------------------------------------------------------// Positions// Position describes an arbitrary source position// including the file, line, and column location.// A Position is valid if the line number is > 0.//type Position struct { Filename string // filename, if any Offset int // offset, starting at 0 Line int // line number, starting at 1 Column int // column number, starting at 1 (byte count)}
This is very simple to mark the location in the file, the more interesting is Pos
the definition type Pos int
, which is a compact representation of the position. Let's look at how the Pos and position are converted.
A FileSet
large array that can be understood as storing the contents of bytes in order is defined first, File
and a file belongs to an interval [base,base+size] of the array, where base is the position of the first byte of the file in the large array, and size is the file. A file Pos
is a small mark in the [Base,base+size] interval.
So finally it Pos
can be compressed into an integer to represent the location of a file, when it is necessary to use the use of the FileSet
conversion from the full Position
object.
go/token/serialize.go
is for FileSet
serialization, which is skipped here.
So the whole go/token
package is just some definition and transformation of token, part of lexical analysis go/scanner
.
The main thread of scan is as follows: The principal is a state machine represented by a switch case,
For example, when a character is scanned so that it is not a character as an identifier, such as hitting a number, it is possible to follow the scanned number and then look back at a small number and then scan the number until there is no number.
The scan will return a scanned token, a compressed representation of the location, and a string of literals, so that a source file can be converted into a token stream of tokens, that is, the process of tokenize or lexical analysis.
Func (S *scanner) Scan () (POS token. Pos, Tok token. token, lit string) {Scanagain:s.skipwhitespace ()//Current token start pos = S.file.pos (S.offset) Determine token value Insertsemi: = False switch ch: = s.ch; {/* character starts with scan identifier */Case Isletter (ch): lit = s.scanidentifier () If Len (lit) > 1 {//keywords is longer than one letter-avoid lookup otherwise tok = to Ken. Lookup (lit) switch Tok {case token. IDENT, token. Break, token. CONTINUE, token. Fallthrough, token. Return:insertsemi = true}} else { Insertsemi = True Tok = token. IDENT}/* Digit start, scan number */Case ' 0 ' <= ch && ch <= ' 9 ': Insertsemi = t Rue Tok, lit = S.scannumber (falsE) Default:
Take a look at the results of the example.
Func Examplescanner_scan () {///SRC is the input, the we want to tokenize. The source file that needs to be marked src: = []byte ("cos (x) + 1i*sin (x)//Euler")//Initialize the scanner. var s scanner. Scanner Fset: = token. Newfileset ()//positions is relative to Fset//Added to file collection: = Fset. AddFile ("", Fset. Base (), Len (SRC))//Register Input "file"//Initialize scanner s.init (file, SRC, nil/* No error handler */, SCANNER.S cancomments)//repeated calls to Scan yield the token sequence found in the input. For {pos, tok, lit: = S.scan () if Tok = = token. EOF {break} fmt. Printf ("%s\t%s\t%q\n", Fset. Position (POS), Tok, lit)}//continuous scanning to get the following results//lexical analysis is to do such a thing. Output://1:1 IDENT "cos"//1:4 (""//1:5 IDENT "x"//1:6) "" 1:8 + ""//1:10 IMAG"1i"//1:12 * ""//1:13 IDENT "sin"//1:16 (""//1:17 IDENT "x"// 1:18) ""//1:20; "\ n"//1:20 COMMENT "//Euler"}
I have implemented a lexical parser in my data structure character drawing tool [1], which makes it easy for me to construct a character drawing with simple syntax and then insert it into the annotation to assist in interpretation.
The only difference is that I use the channel to read token tokens to increase concurrency, and the tokenization of Go itself is serial, of course, this difference is not really how big, and this design
[2] In the Go template package, Rob Pike also has a related speech.
- Https://github.com/ggaaooppeenngg/cpic/blob/master/lex.go
- Http://cuddle.googlecode.com/hg/talk/lex.html#landing-slide