[Translate] Compiler (7)-Scan

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Here's the original.
———— Translation Divider Line ————

Compiler (7)-Scan

The first part: Introduction
Part II: compilation, Translation and interpretation
Part III: Compiler Design Overview
Part IV: Language Design Overview
Part V: Calc 1 Language Specification sheet
Part VI: Identifiers

Now you can finally start working on the scanner.

Lexical analysis

So, where do we start?

This is the hardest part, and for me, the scan should look pretty simple, but soon I get lost in the details. There are many ways to implement scanners, and I'll just show you one of them. Here's a presentation from Rob Pike in a presentation, and it's about another cool way: lexical scanning in Go.

The basic principle of the scanner is to retrieve it from top to bottom, from left to right, until the end of the source code. Each time, the desired element is found, the lexical string is reported to be found, and the identifier tells the parser what it is and where to find it.

Finite state machine

I'm not really going deep into the details of finite state machines (finite automata) or any other relevant content. You should explore the content yourself. Coursera has a course on compiler design that might help, and it also contains relevant topics. Thought is important, but it's not necessarily necessary to know every detail (but I encourage you to get to know them).

The basic idea is that our scanners will return a finite number of states. These states are identified by identifiers and can only return a limited number of identifiers that have been defined. We can think of our scanners as having a limited state. That is, the finite state machine. Understanding automata is helpful for understanding regular expressions and whether a separate character should be accepted or rejected when scanning.

Soon these will become clear.

Oh

I want to clarify some of the mistakes I made when I first tried to write a scanner. It's a bad idea to write any part of the compiler before you have any reliable definition of the language. If the core design of the language is always changing, you will need to rewrite the compiler constantly. That's it, constant. I had to rewrite my first interpreter several times, and every time I revised the language, it was all kinds of egg aches. It's a total waste of time. The application of my Yin language Group was dismissed, mostly because I wrote it at the time of application: Looking for the specification. The author and Master Wang are completely different in the design and implementation of language and compilers. But I think that there should be no right or wrong, only choice.

This process eventually made me realize that this was a very bad decision. Those ideas that seemed like a great idea at first would eventually turn out to be an idea, but not making a decision at all would eventually be disastrous. Many times, I asked myself in the language design, "What the hell did you see?" That's stupid! "I am the hindsight of the Hundred."

Scanning device

The scanner is quite simple. Let's start with a simple object that can track some things. For example, trace the currently scanned characters, the offset from the beginning of the file (read offset), the code that has been scanned, and pointers to the details of the file itself.

The first step in scanning is to initialize the scanner with init. Nothing special here, the outer layer will continue to call the next method, I call this "filling ammunition."

The next method is a more interesting function. It first subscript character the current character to zero to determine the end of the file. If the read offset (roffset) is smaller than the length of the file, the offset (offset) is modified to read offset (roffset). If a new row is found, its location is recorded in the file object. Idea, although we discarded the new line, but still recorded its location. Finally, the current character is updated and the read offset is increased.

Read offset and Unicode

How do you handle increasing the read offset? In particular, Unicode. One character may occupy one or more bytes, so it is not possible to add one to the offset at a time. The Decoderune function of the UTF8 package returns the number of bytes of the character. In this case, the read offset confirms the starting position of the next character to be read.

The current scanner is not Unicode-friendly, but it is possible to integrate these functions so that you can end up doing less work when adding Unicode support. The IsDigit and IsSpace functions in the Unicode package are also used.

Scanning

This is the basic function of the scanner. The Scan method skips whitespace characters first. Skipwhitespace just lets the scanner move forward one character at a time, up to the first non-whitespace character. I use Unicode. IsSpace to achieve this goal.

Next, look at the multi-character element. In this case, just look for numbers. Then look for single-character elements, eventually wrapping everything together, reporting the wrong characters or ending file processing.

Each part ends with a call to next to increase the position of the scanner and return the result of the scan.

We also need to have a language specification at hand. It tells us exactly what needs to be done. If you need to, you can jump back to part five to find it.

Integer

If we encounter numbers, we use ScanNumber to scan a longer string.

I choose to use Unicode. IsDigit function, but you can also write a simple implementation of your own. For example, return s.ch >= '0' && s.ch <= '9' This simple code can be satisfied. The scannumber will continue to move the scanner forward until it encounters the first non-digit. The IF statement after the loop handles the case where the number is at the end of the file.

In later versions of Calc, or in your own version, I would extend this function to include numbers in other formats, such as floating-point or hexadecimal.

More scans

If no number is found, continue checking for a single character. Everything except semicolons is easy to understand.

The contents of the file where the semicolon starts to the line are commented. Annotations are discarded directly in the current scanner implementation, but it is also easy to keep them. For example, the Go scanner passes any annotations it finds to the parser, so that the parser can create a very good go document that everyone loves!

Summarize

This is all of the scanners. Pretty straightforward.

Forward to the parser!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.