Chapter 4 Syntax Analysis
This chapter was devoted to parsing methods that was typically used in compilers.
We first present the basic concepts, then techniques suitable for hand implementation, and finally algorithms that has be En used in automated tools. Since programs may contain syntactic errors, we discuss extensions of the parsing methods for recovery from common errors.
By design, every programming language have precise rules that prescribe the syntactic structure of well-formed programs. In C, for example, a program is made up of functions, a function out of the declarations and statements, a statement out of ex Pressions, and so on. The syntax of programming language constructs can be specified by Context-free grammars or BNF (Backus-naur Form) notation , introduced in section 2.2. Grammars order significant benefits for both language designers and compiler writers.
- A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming language.
- From certain classes of grammars, we can construct automatically a efficient parser that determines the syntactic structu Re of a source program.
- As a side benefit, the parser-construction process can reveal syntactic ambiguities and trouble spots that might has slip PED through the initial design phase of a language.
- The structure imparted to a language by a properly designed grammar are useful for translating source programs into correct Object co de and for detecting errors.
- A grammar allows a language to is evolved or developed iteratively, by adding new constructs to perform new tasks. These new constructs can be integrated more easily to an implementation that follows the grammatical structure of the LA Nguage.
4.1 Introduction
In this section, we examine the parser fits into a typical compiler. We then lo ok @ Typical grammars for arithmetic expressions. Grammars for expressions suffice for illustrating the essence of parsing, since parsing techniques for expressions carry O Ver to the most programming constructs. This section ends with a discussion of error handling, since the parser must respond gracefully to finding that its input Cannot is generated by its grammar.
4.1.1 The Role of the Parser
In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as shown in Fig. 4.1, and verifies That the string of token names can is generated by the grammar for the source language. We expect the parser to rep ort a syntax errors in a intelligible fashion and to recover from commonly occurring errors To continue processing the remainder of the program.
Conceptually, for well-formed programs, the parser constructs a parse tree and passes it-the rest of the compiler for F Urther processing. In fact, the parse tree need not being constructed explicitly, since checking and translation actions can be interspersed wit H parsing, as we shall see. Thus, the parser and the rest of the front end could well is implemented by a single module.
|
Figure 4.1:position of parser in compiler model |
There is three general types of parsers for grammars:universal, Top-down, and bottom-up. Universal parsing methods such as the Cocke-younger-kasami algorithm and Earley ' s algorithm can parse any grammar (see the Bibliographic notes). These general methods is, however, to-off inefficient to use in production compilers.
The methods commonly used in compilers can be classified as being either Top-down or bottom-up. As implied by their names, top-down methods build parse trees from the top (root) to the bottom (leaves), while bottom-up Methods start from the leaves and work their-to-the-root. In either case, the input-to-the-parser is scanned from left-to-right and one symbol at a time.
The most efficient top-down and bottom-up methods work only for subclasses of grammars, but several of these classes, part Icularly, LL and LR grammars, is expressive enough to describe most of the syntactic constructs in modern programming LAN Guages. Parsers implemented by hand often use LL grammars; For example, the predictive-parsing approach of sections 2.4.2 works for LL grammars. Parsers for the larger class of LR grammars is usually constructed using automated tools.
In this chapter, we assume the the output of the parser is some representation of the parse tree for the stream of tokens That's comes from the lexical analyzer. In practice, there is a number of tasks that might is conducted during parsing, such as collecting information ab out Var IOUs tokens into the symbol table, performing type checking and other kinds of semantic analysis, and generating Intermedi Ate co de. We have lumped all of these activities to the "rest of the front end" box in Fig. 4.1. These activities is covered in detail in subsequent chapters.
4.1.2 Representative Grammars
Some of the grammars that would be is examined in this chapter is presented here for ease of reference. Constructs that begin with keywords like while or int, is relatively easy to parse, because the keyword Guides the choice of the grammar production that must is applied to match the input. We therefore concentrate on expressions, which present more of challenge, because of the associativity and precedence of O Perators.
Associativity and precedence are captured in the following grammar, which is similar to ones used in Chapter 2 for Describ ing expressions, terms, and factors. E represents expressions consisting of terms separated by + signs, T represents terms consisting of factors separated by * Signs, and F represents factors that can be either parenthesized expressions or identifiers:
/colgroup>
e→e + T | T t→t * F | F f→ (E) | ID |
(4.1) |
/tbody>
Expression Grammar (4.1) belongs to the class of LR grammars that is suitable for bottom-up parsing. This grammar can is adapted to handle additional operators and additional levels of precedence. However, it cannot be used for Top-down parsing because it's left recursive.
The following non-left-recursive variant of the expression grammar (4.1) would be a used for Top-down parsing:
/colgroup>
e→t E ' e ' →+ T E ' |? t→f T ' t ' →* F T ' |? f→ (E) | ID |
(4.2) |
/tbody>
The following grammar treats + and * alike, so it's useful for illustrating techniques for handling ambiguities during PA Rsing:
E→e + E | E * e | (E) | Id |
(4.3) |
Here, E represents expressions of all types. Grammar (4.3) permits more than one the parse tree for expressions like A + b * C.
4.1.3 Syntax Error Handling
The remainder of the considers the nature of syntactic errors and general strategies for error recovery. These strategies, called Panic-mode and phrase-level Recovery, is discussed in + detail in connection with spec Ific parsing methods.
If a compiler had to process is correct programs, its design and implementation would is simplified greatly. However, a compiler is expected to assist the programmer on locating and tracking down errors that inevitably creep into p Rograms, despite the programmer ' s best efforts. Strikingly, few languages has been designed with error handling in mind, even though errors is so common-place. Our civilization would is radically different if spoken languages had the same requirements for syntactic accuracy as comp Uter languages. Most programming language specifications does not describe how a compiler should respond to errors; Error handling is the compiler designer. Planning the error handling right from the start can both simplify the structure of a compiler and improve its handling of Errors.
Common programming errors can occur at many different levels.
- Lexical errors include misspellings of identifiers, keywords, or operators-e.g., the use of a identifier elipsesize in Stead of Ellipsesize--and missing quotes around text intended as a string.
- Syntactic errors include misplaced semicolons or extra or missing braces; That is, "{" or "}." As another example, in C or Java, the appearance of a case statement without an enclosing switch are a syntactic error (how Ever, this situation was usually allowed by the parser and caught later in the processing, as the compiler attempts to gene Rate code).
- Semantic errors include type mismatches between operators and operands, e.g., the return of a value in a Java method with Result type void.
- Logical errors can be anything from incorrect reasoning on the part of the programmer to the use of a C program of the Ignment operator = Instead of the comparison operator = =. The program containing = May is well formed; However, it may not reflect the programmer ' s intent.
The precision of parsing methods allows syntactic errors to be detected very efficiently. Several parsing methods, such as the LL and LR methods, detect an error as soon as possible; That's, when the stream of tokens from the lexical analyzer cannot was parsed further according to the grammar for the LAN Guage.
More precisely, they had the Viable-prefix property, and meaning that they detect, a error has occurred as soon as they See a prefix of the input that cannot is completed to form a string in the language.
Another reason for emphasizing error recovery during parsing are that many errors app ear syntactic, whatever their cause, and is exposed when parsing cannot continue. A few semantic errors, such as type mismatches, can also be detected efficiently; However, accurate detection of semantic and logical errors at compile time was in general a difficult task.
The error handler in a parser have goals that is simple to state and challenging to realize:
- Report the presence of errors clearly and accurately.
- Recover from each error quickly enough to detect subsequent errors.
- ADD minimal overhead to the processing of correct programs.
Fortunately, common errors is simple ones, and a relatively straightforward error-handling mechanism, often suffices.
How should a error handler rep ort the presence of an error? At the very least, it must rep. ort the place in the source program where an error is detected, because there is a good cha NCE that the actual error occurred within the previous few tokens. A Common strategy is to print the offending line with a pointer to the position in which an error is detected.
4.1.4 Error-recovery Strategies
Once An error was detected, how should the parser recover? Although no strategy has proven itself universally acceptable, a few methods with broad applicability. The simplest approach is for the parser-to-quit with a informative error message when it detects the first error. Additional errors is often uncovered if the parser can restore itself to a state where processing of the input can Contin UE with reasonable hopes that the further processing would provide meaningful diagnostic information. If errors pile up, it's better for the compiler-give up after exceeding some error limit than-produce an annoying a Valanche of "spurious" errors.
The balance of the devoted to the following recovery Strategies:panic-mode, Phrase-level, Error-productions, and Global-correction.
Panic-mode Recovery
With this method, on discovering an error, the parser discards input symbols one at a time until one of a designated set O F synchronizing tokens is found. The synchronizing tokens is usually delimiters, such as semicolon or}, whose role in the source program is clear and Una Mbiguous. The compiler designer must select the synchronizing tokens appropriate for the source language. While Panic-mode correction often skips a considerable amount of the input without checking it for additional errors, it has t He advantage of simplicity, and, unlike some methods to be considered later, was guaranteed not to go into an infinite loop .
Phrase-level Recovery
On discovering an error, a parser could perform local correction on the remaining input; That's, it may replace a prefix of the remaining input by some string that allows the parser to continue. A typical local correction is to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semic Olon.
The choice of the local correction is left to the compiler designer. Of course, we must is careful to choose replacements that does not leads to infinite loops, as would is the case, for example The If we always inserted something on the input ahead of the current input symbol.
Phrase-level replacement have been used in several error-repairing compilers, as it can correct any input string. Its major drawback is the difficulty it have in coping with situations in which the actual error have occurred before the PO int of Detection.
Error Productions
By anticipating common errors this might be encountered, we can augment the grammar in language with hand ONS that generate the erroneous constructs. A parser constructed from a grammar augmented by these error productions detects the anticipated errors when an error prod Uction is used during parsing. The parser can then generate appropriate error diagnostics about the erroneous construct that have been recognized in the I Nput.
Global Correction
Ideally, we would like a compiler to make as few changes as possible on processing an incorrect input string. There is algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction. Given an incorrect input string x and grammar G, these algorithms would find a parse tree for a related stringy, such that The number of insertions, deletions, and changes of tokens required to transform X into Y is as small as possible. Unfortunately, these methods is in general too costly to implement in terms of time and space, so these techniques is CU rrently only of theoretical interest.
Do note this a closest correct program May is not being what is the programmer had in mind. Nevertheless, the notion of least-cost correction provides a yardstick for evaluating error-recovery techniques, and have B Een used for finding optimal replacement strings for phrase-level recovery.
Chapter 4 Syntax Analysis