This is a creation in Article, where the information may have evolved or changed.
Here's the original.
———— Translation Divider Line ————
Compiler (5)-Language specification
The first part: Introduction
Part II: compilation, Translation and interpretation
Part III: Compiler Design Overview
Part IV: Language Design Overview
This is the final part about CALC's design specifications!
Design language
I want to keep the language as simple as possible. I call this language Calc. Very clear, is used for Calculators (calculator). Are you smart, smart? Okay, well, go on!
I also want to have a single base type. I decided to do something as smart as the name of the language, called it an Integer (integer). I know, very clever. Your admiration will be in the annals of history. For the sake of simplicity, I decided to avoid dealing with floating-point numbers, as well as binary, octal, hexadecimal, and scientific notation. I'll add other counting methods as an exercise for you.
We also need to represent end-of-file, gaze, and mathematical symbols.
Before we begin to parse and execute our language, there is also the basic content that needs to be determined is operator precedence and encapsulation. Blessing or curse, this time I decided not to screw things up on the operator priority. So I decided to use pre-defined symbols and Lisp-style expressions. That is, the operator is in front of the parameter and is encapsulated by the left and right brackets. The two plus three examples are as follows:
Example 1 (simple expression):
(+ 2 3)
Result:5
There are two benefits: 1) explicit operator precedence, and, 2) we can let operators handle numbers of any value.
Example 2 (operator precedence):
(* (+ 3 1) (/4 2))
Result:8
There are no problems with what the operators do in what order. Before the multiplication is calculated, the addition and division are computed separately. No arithmetic operator precedence (bedmas)!
Example 3 (Multi-value):
(* 1 2 3 4 5)
result:120
In this case, we handle each element of an expression from left to right. This function can also be written in the following equivalent form:
(* * (* 1))) = = (((1*2) *) () = = 1*2*3*4*5
Therefore, we need to define the identifiers for the left and right (opening and closing) brackets, and each mathematical symbol that you want to use.
Finally, you should have a comment. We will only add single-line comments, but multiple lines should not be difficult to add. Let's use a semicolon to represent a comment.
The identifier that you need to use
- Opening parenthesis
- Closing parenthesis
- Plus
- Minus sign
- Multiplication sign
- Division sign (commercial)
- Modulus (remainder)
- Integer
- Semicolon (note)
That's all you need before you start the lexical analysis. The conscientious should have noticed that I did not define identifiers for any keywords, built-in functions, or variables. Because our simple language doesn't have any of these things.
The initial goal of this series is not to teach you to design a full-fledged, fully functional programming language. The core purpose is simply to provide some basic counting to write a beginner, working compiler.
In the next section of the series, I'll describe how to add these features.
White space characters
The next thing to decide is how to deal with whitespace characters. In languages such as Lisp or Go, whitespace characters are often ignored. I say it's usually because it's not completely overlooked. Strings and comments do not omit whitespace characters. It is also used to interval elements such as numbers. However, it can be seen from being ignored that it is not very important for a language's syntax or statement.
By the way, white space characters are usually spaces, tabs, or line breaks. On Windows, you should also include a carriage return, and so on. Basically, any character that can be entered by the keyboard is not directly displayed nonalphanumeric is considered.
Currently, your language can also take blank characters into consideration. In Python, for example, they are very important and are delimited as statements. Calc does not do this.
Grammar
We have listed all the elements that will be contained in the code. Now they need to be given their own meanings.
Because you want to use it to express mathematical equations. We give a great deal of attention to the expressions in the language. But there's no statement (a bad joke, I want someone to read it).
Each expression must be between parentheses, starting with an operator, and requiring two or more than two operands. An example:
"(" operator Operand1 Operand2 [... [OPERANDN]] “)”
I enclose the brackets in quotation marks to indicate that they are text, are necessary elements, and not extensible language rules. The operator and two operands are also required elements. The square brackets indicate that the extra operator is optional.
What can we do with the operator? They can only be one of the following symbols: +–*/%
So what about the operands? Well, things are starting to get interesting now, but I think we can handle them. An operand can be a number or an expression. This allows us to perform complex computations through the nesting of expressions.
I prefer to express grammar in a different form. This is a way to define something, called context-independent syntax. This particular form differs from the traditional Backus-Noel expression (BNF). With this notation, our language can be expressed as:
File:: = Expr
Expr:: = "(" operator whitespace expr (whitespace expr) + ")" | Integer
Operator:: = [+-*/%]
Whitespace:: = [\t\n\r]+
Integer:: = digit+
Digit:: = [0-9]
I'm not an expert on context-free grammar, so I hope I'm not mistaken. As you've already seen, it makes the list of the above-mentioned lists fairly concise, making it easier for our parsers to design.
Postscript
Our language is very simple, but with the help of grammar, we expect to be clearer. And it's much easier to write scanners and parsers.
For example the definition of integers, we need to scan what at a glance. We need to scan one or more digits from zero to nine. The same is true for whitespace symbols, as long as there is one or more spaces, tabs, or line breaks.
So we'll know what to actually parse. A program is made up of an expression. Each expression is wrapped within parentheses, and the second element of the expression must be an operator. An operand can be an expression or a number. At least two operands are required.