Preface
In daily development work, we always need to write some syntax analyzers from time to time. The syntax analyzer does not necessarily refer to the front-end of a language compiler, but may also be a read/write program for configuration files designed by itself, or a DSL (domain-specific language) that simplifies our development ). We can choose to use XML, but because XML is too noisy, it is necessary to write a syntax analyzer by yourself, especially the files that need to be modified frequently, using XML sometimes increases our burden unless we develop an editor program specifically for this purpose.
This article will closely combine the example of a function-based arithmetic calculator (DocumentationSamplesExpressionCalculatorExpressionCalculator. sln) to illustrate how to use the tools provided by Vczh Library ++ to greatly simplify the development of our syntax analyzer, and finally give an example that can be compiled. Although this example is an old one, developing a four-digit arithmetic calculator can cover most of the problems encountered during the development of syntax analysis. Therefore, it is also a good example.
This example can be found in the Code of Vczh Library ++.
Develop syntax
We need to define the four arithmetic calculators with functions so that we can complete this task purposefully. We are familiar with the four operators. A four operator contains addition, subtraction, multiplication, division, Parentheses, and numbers. We can also support negative signs:-a, which is in short form (0-a. So what are supported functions? Here we only consider single-parameter functions, such as trigonometric functions and logarithm indexes. For example, the following formula is the four operators that meet the defined function:
Sin (1 + 2) + cos (3 *-4)
Vczh Library ++ treats a string from the perspective of syntax, so we can convert the definition above to the idiom method. A syntax is used to represent a subset of a string. We can use the syntax to express what types of strings meet the requirements and what types of strings do not meet the requirements. However, a realistic syntax always has some limitations. For example, it is difficult for you to use context-independent syntax to express a string:... AB... Bc... C. The numbers of the three letters are equal. Fortunately, in most cases, we do not need to deal with these difficult problems, so we can use some simple rules to deal with them:
RULE = EXPRESSION
RULE is the name of the RULE, and EXPRESSION is the definition of the RULE. A syntax can be composed of one rule or multiple rules. When all rules are listed, the name of each rule is a string set. In most cases, you need to specify a "total entry" to represent the entire syntax.
For example, let's determine whether a string is an unsigned integer. An unsigned integer can only contain digits. Therefore, we can use a rule to represent "number characters ". Here we can use "|" to represent "or", then the following rule indicates that DIGIT is '0' or '1' or... Or '9 ':
DIGIT = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
The unsigned integer is "Many numeric characters ":
INTEGER = DIGIT | INTEGER DIGIT
An unsigned INTEGER is either a numeric character or a valid unsigned INTEGER followed by a numeric character. An unsigned integer plus a numeric character is still an unsigned integer.
Now let's test it. For example, if "1" is an unsigned INTEGER, the path for analyzing "1" is
INTEGER
= DIGIT (INTEGER = DIGIT)
= '1' (DIGIT = '1 ')
The string "123" should also be an unsigned integer. "123" is composed of numbers, so the path is slightly different from the single character. Here, the two paths of INTEGER are used to simulate the loop:
INTEGER
= Integer digit (INTEGER = integer digit)
= Integer digit (INTEGER = integer digit)
= DIGIT (INTEGER = DIGIT)
= '1' DIGIT (DIGIT = '1 ')
= '1' 2' DIGIT (DIGIT = '2 ')
= '1' '2' 3 '(DIGIT = '3 ')
When using INTEGER to analyze "123", we can use the INTEGER = DIGIT and INTEGER = integer digit rules to replace one INTEGER with exactly three DIGIT rules, then, replace DIGIT with the '1', '2', and '3' characters to ensure that "123" meets the definition of INTEGER, "123" is an unsigned integer.
The replacement process is not unique. We can replace the INTEGER with "123" in another order ":
INTEGER
= Integer digit (INTEGER = integer digit)
= INTEGER '3' (DIGIT = '3 ')
= Integer digit '3' (INTEGER = integer digit)
= INTEGER '2' 3 '(DIGIT = '2 ')
= DIGIT '2' 3' (INTEGER = DIGIT)
= '1' '2' 3 '(DIGIT = '1 ')
This is a feature of the syntax: the order of replacement is irrelevant to the result.
Now let's take this example further. How can we use syntax rules to describe a comma-separated list of unsigned integers? The list of unsigned integers separated by commas can be an integer "123" or multiple integers "456 ". This is also a type of repetition, but it is different from the type of repetition of INTEGER-an extra comma. According to the above description, there are two conditions for the list of unsigned integers separated by commas (,). The first is a separate integer, the second is a completed list followed by a comma and an integer. Then things become simple. If we use LIST to represent this LIST, we can use similar techniques to describe it based on the above description:
LIST = INTEGER | list', 'integer
The process of using LIST to analyze a number LIST is similar to using INTEGER to analyze an unsigned INTEGER. Due to space limitations, Here we only show one of the methods to process "456," using LIST:
LIST
= List', 'integer (LIST = list', 'integer)
= List', 'integer', 'integer (LIST = list', 'integer)
= INTEGER ', 'integer', 'integer (LIST = INTEGER)
= Digit', 'integer', 'integer (INTEGER = DIGIT)
= '1', 'integer