Definitions and usage of BNF and EBNF

Source: Internet
Author: User

Recently I have developed a compiler and interpreter.

I started to have no idea about the basic concept BNF syntax because of my own access. Now I am posting it here. It is easy to understand and learn.

Transfer http://blog.chinaunix.net/u/3176/showart_404211.html

 

 

Definitions and usage of BNF and EBNF
Definitions and usage of BNF and EBNF(Thanks to the translator: Sunnybill)

By: Lars Marius Garshol
For more information, see:Http://www.garshol.priv.no/download/text/bnf.html
(This is the translation of the above-mentioned author's article. The original Article is copyrighted by the author)
Sunnybill)
Meaning and usage of BNF and EBNF 1
Introduction
About this article
What is BNF?
Working Principle
Basic Principles
One instance
EBNF and Its Usage
An EBNF syntax instance
Use of BNF and EBNF
General Usage
How to use the formal syntax
Analysis
The simplest method
Top-down resolution (LL)
One LL analysis instance
One LL conversion instance
Slightly difficult method
Bottom-up parsing (LR)
LL or LR?
More information
Appendix
Thank you

Introduction
About this article
This is an article for <Wkwwagbizn.fsf@ifi.uio.no> 16. jun.98-June 16 was released in comp. text. a short text explaining BNF written by sgml is a bit rough. If you are not familiar with it, contact the author. The author will try to explain it as much as possible.
The article is getting longer and longer, but you don't have to worry about it. The article will go deeper and deeper. If you don't want to know more, you can simply look at the content you are interested in and find the answer to your question.

What is BNF?
The Backus-Naur symbol (BNF or Backus-Naur Form) is a formal mathematical method used to describe the language. It was developed by John Backus (maybe Peter Naur, the syntax used to describe the Algol 60 programming language.
Originally, it was developed based on John Backus's early work at the mathematician Emil Post. Peter Naur adopted it in Algol 60 and made some improvements, therefore, Naur calls BNF Backus Normal Form, while others call it Backus-Naur Form.
BNF is used to formally define the syntax of a language so that its rules are unambiguous. In fact, BNF is very accurate, and there are many mathematical theories around these syntaxes, so that people can mechanically construct a parser for a language based on BNF syntax. (Some syntaxes cannot be implemented, but they can be manually converted to other forms ).
The program that implements this function is called the compiler. The most famous is YACC. Of course, there are many other programs.

Working Principle
Basic Principles
BNF is similar to a mathematical game: starting from a symbol (called the start sign, which is commonly used in the instance), and then giving the rule to replace the previous symbol. The language defined by the BNF syntax is only a string set. You can follow the following rules. These rules are called writing rules (production rules) in the following format:
Symbol: = alternative1 | alternative2...
Each rule Declaration: = the symbol on the left must be replaced by an option on the right. The replacement items are separated by "|" (sometimes ": =" is used to replace ": =", but the meaning is the same ). A replacement item usually consists of two symbols and terminator. The Terminator is called a terminator because there are no writing rules for them. They are the termination of the writing process (a symbol is usually called a non-terminator or a non-terminal character ).
Another change in BNF syntax is to put terminator (terminal) in quotation marks and separate them from symbols. Some BNF syntaxes use symbols to clearly indicate spaces, while some syntaxes leave it to the reader's speculation.
BNF has a special symbol "@", indicating that the symbol can be removed. If you replace the symbol with @, you only need to remove the symbol. This is useful because it is sometimes difficult to terminate the replacement process without using this technique.
Therefore, a syntax description language is a set of strings written by writing rules (production rules. If a string cannot be written using these rules, the string is disabled in this language.

One instance
The following is an example of BNF Syntax:
S: = '-' FN | FN
FN: = DL | DL '. 'dl
DL: = D | D DL
D: = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
All the symbols here are abbreviated: S is the starting character, FN generates a score, DL is a list of numbers, and D is a number.
The valid sentences in the language described in this syntax are numbers, which may be scores or negative numbers. To write a number, start with the "S" symbol:
S
Then, replace it with the result of S. In the above example, we can choose not to add "-" before the number, instead of using FN, and replace s with FN:
FN
Then, replace Fn with the FN result. We want to write a score, so we choose to generate two median ". "In decimal number list, then we replace a symbol with the result of a symbol in each line, as shown in the following example:
DL. DL
D. DL
3. DL
3. d dl
3. d
3. 1 d
3. 1 4
Here we write a score of 3.14. To learn how to write-5, you can practice it on your own. To thoroughly understand it, you also need to study the syntax to understand why you cannot write 3 .. 14 according to the above rules.

EBNF and Its Usage
In DL, we have used recursion (for example, DL can generate new DL) to express many numbers D. This is a little inflexible, making BNF hard to read. Extended BNF (EBNF) solves this problem by introducing the following operators:
L? : The symbol on the left of the operator (or a group of symbols in parentheses) is optional (0 to multiple times can appear ).
L *: it can be repeated multiple times.
L +: can appear multiple times.

An ebnf syntax instance
The above example can be written as EBNF:
S: = '-'? D + ('. 'D + )?
D: = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Tip: EBNF is not more powerful than BNF in definition language, but more convenient. Anything written with EBNF can be converted into BNF.

Use of BNF and EBNF
General Usage
Most programming language standards use ebnf variables to define language syntax. There are two advantages: first, there is no controversy in the syntax of the language, but it is easy to compile the compiler, because the Compiler parser can be automatically generated using a "compiler" like YACC.
Ebnf is also used in many other standards, such as defining protocol formats, data formats, XML, and SGML Markup languages. (HTML is defined by advanced syntax such as sgml dtd instead of syntax .)
In BNF web Club (Http://cuiwww.unige.ch/db-research/Enseignement/analyseinfo/BNFweb.html), You can find a syntax set of BNF.

How to use the formal syntax
Now we know what BNF and ebnf are and their purposes, but we still don't know why they are useful and how to use them.
The most obvious usage of formal syntax has been mentioned: Once a formal syntax is developed for a language, it is fully defined. For those things that can be used in languages, they cannot be ambiguous. This is very useful because the syntax described in common languages is not only lengthy, but also produces different interpretations.
Another advantage is that formal syntax is a product of mathematics and can be understood by computers. In fact, many programs can use (e) BNF syntax to input and automatically provide code for the parser with given syntax. In fact, this is a common practice for designing compilers: Use the so-called "compiler" with syntax input to generate parser code for programming languages.
Of course, in addition to syntax checks, the compiler also performs many other checks, and they also generate code. These are not described in (e) BNF. Therefore, the compiler usually has a special syntax for code block connections (also called operations) for different codes in a syntax.
The most famous compiler is YACC (Yet Another Complier), which generates C code. Other compilers also support C ++, JAVA, Python, and other languages.

Analysis
The simplest method
Top-down resolution (LL)
The simplest way to parse code according to the existing syntax is LL resolution (top-down resolution ). It works like this: Find out the non-terminal mark (called the initial set) at the beginning of the code for each piece of code ).
Then, when parsing, you only need to compare the initial set of different code segments (symbols) and the first input symbol from the start operator, determine the start operator used by the code segment as the START (which symbols are used ). This can be done only when two initial sets without a single symbol contain the same Terminator. Otherwise, you cannot select the start character (Symbol) through the first Terminator.
The LL syntax is generally classified by number, such as LL (1) and LL (0. The number in parentheses is the maximum number of terminals that need to be considered simultaneously when selecting the appropriate symbol at any point in the syntax. Therefore, LL (0) does not need to check the terminal (Terminator) at all. You can always select an appropriate Terminator. This only happens when all symbols have only one replacement character, and if there is only one replacement character, it means that the language has only one string. That is, LL (0) is meaningless.
The most common and useful LL syntax is to contact LL (1). By checking the first terminator input, you can always select an appropriate replacement character. LL (2) needs to check two symbols, and so on.

One ll analysis instance
To illustrate this, we perform an initial set analysis on the instance syntax. This is very easy for symbol D: All replacement symbols are a number like their initial set (which they generate). symbol D uses a set of 10 numbers as the initial set. This means at least one LL (1) syntax, because we need a terminal to select an appropriate replacement character.
It will be a little troublesome for DL. Both replicas start with "D", so they all have the same initial machine. This means that you cannot select an appropriate replacement by checking the first terminator entered. However, we can easily overcome this difficulty through deception: If the second Terminator is not a number, we will use the first replacement operator. If both are numbers, use the second replacement character. That is to say, this is at least LL (2) syntax.
This actually simplifies things. The DL replacement operator does not tell us which Terminator is allowed after the first terminator in the replacement operator of d @, because we need to know which Terminator is allowed after the DL. This Terminator set is called the follow set of the symbol. Here it is "." And it is the end of the input.
The FN symbol is worse, because both replicas use numbers as their initial set. Check whether the second Terminator is useless because the first terminator needs to be viewed after the last number in the number list (DL, however, we need to read all the numbers before we can know the number. Since there are no limit on the number of digits, this is not a LL (k) syntax for any K.
It may be strange that the S symbol is very simple. The first replacement item "-" is its initial set, and the second item is all numbers. That is to say, when parsing, check the input item from the S symbol to determine which replacement item to use. If the first terminal is "-", replace "-" with the first terminal; otherwise, use the second terminal. Only the replacement of FN and DL may cause problems.

One ll conversion instance
In fact, there is no need to be disappointed. Most non-LL (k) syntaxes can be easily converted into LL (1) syntaxes. In this case, we need to convert two symbols: FN and DL.
The problem with FN is that both replacement items start with DL, but the second is followed by a "." And the DL after the other initial DL. This is a good solution: We change FN to a replacement item starting with a DL followed by a FP (score), FP can be empty or "." followed by a DL. As follows:
FN: = DL fp
FP: = @ | '.' DL
Now there is no problem with FN, because there is only one replacement item, and there is no problem with FP, because the two replacement items have different initial sets. They are the end of the input and ".".
DL is a hard bone. Because the main problem lies in recursion and is composite, we need to get a D from DL. The solution is to assign a separate replacement item to DL, one D followed by one DR (other numbers ). In this case, DR has two replacement items: D and DR or @. The first replacement item takes all numbers as the initial set, and the second item contains "." And the end of the input as its initial set, thus solving the problem.
In this way, it is completely converted into the LL (1) Syntax:
S: = '-' FN | FN
FN: = DL FP
FP: = @ | '.' DL
DL: = D DR
DR: = d dr | @
D: = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Slightly difficult method
Bottom-up parsing (LR)

A slightly difficult method is the mobile replacement method or bottom-up parsing. This technology collects all input direct channels and finds input that can be restored to symbols. This sounds very difficult, so we need to give an example. Let's parse "3.14" to see how to generate from this syntax. We start from reading the input "3:
3
Then, we can check whether it can be restored to the symbols that generate it. Actually we can, it is generated from the symbol D, And we replace 3 with the symbol D. In this case, we noticed that D can be generated from DL, so replace D with DL. (This syntax is uncertain. We can always restore to FN, but it is wrong. For the sake of simplicity, we skip the error step here, but the clear syntax won't allow such an error to happen.) Then we read "." From the input and try to restore it, but failed.
D
DL
DL.
It cannot be restored to other things, so we continue to read the other input "1" and restore it to D. When reading the next input "4 ", "4" can also be restored to D, then to DL, and then the "d dl" sequence can be further restored to DL.
DL.
DL. 1
DL. D
DL. D 4
DL. D
DL. D DL
DL. DL
The syntax shows that FN can restore the "DL. DL" sequence and then restore it. FN is generated from S, so FN can be restored to S, and the resolution has been completed so far.
DL. DL
FN
S
You may notice that we can restore the file or wait for multiple symbols to restore the file. This shift reduction resolution algorithm has many complex changes, which are divided into LR (0), SLR, LALR, and LR (1) According to the complexity and function ). LR (1) requires too many parsing tables, so LALR is the most common algorithm, and SR (0) and SLR are not powerful enough for most programming languages.
LALR and LR (1) are too complicated to discuss here, but you have understood the basic idea.

Ll or LR?
This question has been answered. Here we will reference his new message:
I hope it will not cause controversy. First, when Frank sees this, he will not beat me (my boss is Frank DeRmer, the founder of LALR parsing ...).
(Borrow the abstract of Fischer & LeBlanc's "Crafting a Compiler)
Simplicity--LL
General Generality--LALR
Operate Actions--LL
Error Recovery Error repair---LL
Table size Table sizes--LL
Resolution speed: Parsing speed--comparable (me: and tool-dependent)
Simplicity-LL-dominant
============
LL parser is simpler. To debug a parser, consider recursively dropping the Parser (a common method for compiling the LL parser), which is much easier than the LALR parser table.
Versatility-LALR is dominant
============
LALR is easy to win with simplicity defined in specifications. The biggest difference between LL and LALR is that the LL syntax must use the left factor rule and eliminate the left recursion.
It is required to extract the left factor, because the LL parser must be replaced based on a fixed number of inputs.
There is a problem with left regression, because the rule's leading tag always unifies the rule's leading tag. This will lead to infinite recursion.
Refer to the following link to learn how to convert LALR to LL Syntax:
Http://www.jguru.com/thetick/articles/lalrtoll.html
Many languages already have LALR syntax, but still need to be translated. If the language does not have such a syntax, it is not difficult to write a LL syntax.
Operational--LL-dominant
========
In the LL parser, operations can be placed anywhere without causing conflicts.
Bug fix--LL dominant
================
LL parser has rich environmental information (context information), which is helpful for fixing errors rather than reporting errors.
Table size--LL
==============
Suppose you write a table to drive the LL parser, and its table size is only half of the other. (In all fairness, there are many ways to optimize the LALR table and make it smaller)
Resolution speed-comparison (my opinion: Depends on the tool)
-- Scott Stanchfield in article <33C1BDB9.FC6D86D3@scruz.net> On comp. lang. java. softwarw.ls Mon, 07 Jul 1997.

More information
John Aycock developed a very good and easy-to-use parsing framework called SPARK with Python, which is described in a highly readable paper.
The authoritative job in compiler and parser is 'the Dragon Book', also known as Compilers: Principles, Techniques, and Tools, author Aho, Sethi and Ullman. Note that this is an advanced mathematics book.
The free online materials are better:Http://www.cs.vu.nl /~ Dick/ptapg.html.
Another EBNF tutorial of Frank Boumphrey ,(Http://www.hypermedic.com/style/xml/ebnf.htm)
An article about parsing in Common Lisp (Http://home.pipeline.com /~ Hbaker1/Prag-Parse.html) Presents a simple, efficient, and convenient parsing framework. The method is similar to the compiler, but it depends on a very powerful Common Lisp macro system.
The syntax for defining BNF syntax is defined in RFC 2234 (Http://www.ietf.org/rfc/rfc2234.txtIn the ISO 14977 standard.

Appendix
Thank you
Thanks:
• Jelks cabaniss, for encouraging me to turn the news article into a web article, and for providing very useful criticism of the article once it appeared in web form.
• C. M. sperberg-McQueen for extra historical information about the name of BNF.
• Scott stanchfield for writing the great comparison of lalr and LL. I have asked for permission to quote this, but have already ed no reply, unfortunately.
• James Huddleston for correcting me on John Backus 'name.

 Original article address Http://hi.baidu.com/sunnybill/blog/item/cc409226ab3e19148b82a1cb.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.