Compiler Design - Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is recognized by push-down automata.CFG, on the other hand, is a superset of Regular Grammar, as depicted below:
It implies that every Regular Grammar is also context-free, but there exists some problems, which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies used in parsing technology.A context-free grammar has four components:
- A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated by the grammar.
- A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings are formed.
- A set of productions (P). The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.
- One of the non-terminals is designated as the start symbol (S); from where the production begins.
Example
We take the problem of palindrome language, which cannot be described by means of Regular Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means of CFG, as illustrated below:G = ( V, Σ, P, S )Where:
V = { Q, Z, N } Σ = { 0, 1 } P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 } S = { Q }This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against the production rules to detect any errors in the code. The output of this phase is a parse tree.This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we take two decisions for some sentential form of input:- Deciding the non-terminal which is to be replaced.
- Deciding the production rule, by which, the non-terminal will be replaced.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation. The sentential form derived by the left-most derivation is called the left-sentential form.Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most derivation. The sentential form derived from the right-most derivation is called the right-sentential form.Example
Production rules:
E → E + E E → E * E E → idInput string: id + id * id
The left-most derivation is:
E → E * E E → E + E * E E → id + E * E E → id + id * E E → id + id * idNotice that the left-most side non-terminal is always processed first.
The right-most derivation is:
E → E + E E → E + E * E E → E + E * id E → E + id * id E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by an example from the last topic.We take the left-most derivation of a + b * c
The left-most derivation is:
E → E * E E → E + E * E E → id + E * E E → id + id * E E → id + id * idStep 1:
E → E * E |
E → E + E * E |
E → id + E * E |
E → id + id * E |
E → id + id * id |
- All leaf nodes are terminals.
- All interior nodes are non-terminals.
- In-order traversal gives original input string.
Types of Parsing
Syntax analyzers follow production rules defined by means of context-free grammar. The way the production rules are implemented (derivation) divides parsing into two types : top-down parsing and bottom-up parsing.Top-down Parsing
When the parser starts constructing the parse tree from the start symbol and then tries to transform the start symbol to the input, it is called top-down parsing.- Recursive descent parsing : It is a common form of top-down parsing. It is called recursive as it uses recursive procedures to process the input. Recursive descent parsing suffers from backtracking.
- Backtracking : It means, if one derivation of a production fails, the syntax analyzer restarts the process using different rules of same production. This technique may process the input string more than once to determine the right production.
Bottom-up Parsing
As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the parse tree up to the start symbol.Example:
Input string : a + b * c
Production rules:
S → E E → E + T E → E * T E → T T → idLet us start bottom-up parsing
a + b * cRead the input and check if any production matches with the input:
a + b * c T + b * c E + b * c E + T * c E * c E * T E S
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least one string.Example
E → E + E E → E – E E → idFor the string id + id – id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in grammar is not good for a compiler construction. No method can detect and remove ambiguity automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or by setting and following associativity and precedence constraints.
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is decided by the associativity of those operators. If the operation is left-associative, then the operand will be taken by the left operator or if the operation is right-associative, the right operator will take the operand.Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the expression contains:
id op id op idit will be evaluated as:
(id op id) op idFor example, (id + id) + id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression will be:
id op (id op id)For example, id ^ (id ^ id)
Precedence
If two different operators share a common operand, the precedence of operators decides which will take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily removed. As in the previous example, mathematically * (multiplication) has precedence over + (addition), so the expression 2+3*4 will always be interpreted as:2 + (3 * 4)These methods decrease the chances of ambiguity in a language or its grammar.
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So, when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop.Example:
(1) A => Aα | β (2) S => Aα | β A => Sd(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the parser may go into a loop forever.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:The production
A => Aα | βis converted into following productions
A => βA’ A => αA’ | εThis does not impact the strings derived from the grammar, but it removes immediate left recursion.
Second method is to use the following algorithm, which should eliminate all direct and indirect left recursions.
Algorithm START Arrange non-terminals in some order like A1, A2, A3,…, An for each i from 1 to n { for each j from 1 to i-1 { replace each production of form Ai⟹Aj𝜸 with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸 where Aj ⟹ δ1 | δ2|…| δn are current Aj productions } } eliminate immediate left-recursion END
Example
The production set
S => Aα | β A => Sdafter applying the above algorithm, should become
S => Aα | β A => Aαd | βdand then, remove immediate left recursion using the first technique.
A => βdA’ A => αdA’ | εNow none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser cannot make a choice as to which of the production it should take to parse the string in hand.Example
If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …Then it cannot determine which production to follow to parse the string as both productions are starting from the same terminal (or non-terminal). To remove this confusion, we use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make one production for each common prefixes and the rest of the derivation is added by new productions.
Example
The above productions can be written as
A => αA’ A’=> β | 𝜸 | …Now the parser has only one production per prefix which makes it easier to take decisions.
First and Follow Sets
An important part of parser table construction is to create first and follow sets. These sets can provide the actual position of any terminal in the derivation. This is done to create the parsing table where the decision of replacing T[A, t] = α with some production rule.First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal. For example,α → t βThat is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).
Algorithm for calculating First set
Look at the definition of FIRST(α) set:- if α is a terminal, then FIRST(α) = { α }.
- if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.
- if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in FIRST(α).
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production rules. We do not consider what the non-terminal can generate but instead, we see what would be the next terminal symbol that follows the productions of a non-terminal.Algorithm for calculating Follow set:
- if α is a start symbol, then FOLLOW() = $
- if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.
- if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in FOLLOW(α).
Error-recovery Strategies
A parser should be able to detect and report any error in the program. It is expected that when an error is encountered, the parser should be able to handle it and carry on parsing the rest of the input. Mostly it is expected from the parser to check for errors but errors may be encountered at various stages of the compilation process. A program may have the following kinds of errors at various stages:- Lexical : name of some identifier typed incorrectly
- Syntactical : missing semicolon or unbalanced parenthesis
- Semantical : incompatible value assignment
- Logical : code not reachable, infinite loop
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest way of error-recovery and also, it prevents the parser from developing infinite loops.Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing comma with a semicolon etc. Parser designers have to be careful here because one wrong correction may lead to an infinite loop.Error productions
Some common errors are known to the compiler designers that may occur in the code. In addition, the designers can create augmented grammar to be used, as productions that generate erroneous constructs when these errors are encountered.Global correction
The parser considers the program in hand as a whole and tries to figure out what the program is intended to do and tries to find out a closest match for it, which is error-free. When an erroneous input (statement) X is fed, it creates a parse tree for some closest error-free statement Y. This may allow the parser to make minimal changes in the source code, but due to the complexity (time and space) of this strategy, it has not been implemented in practice yet.Abstract Syntax Trees
Parse tree representations are not easy to be parsed by the compiler, as they contain more details than actually needed. Take the following parse tree as an example:Abstract tree can be represented as:
ASTs are important data structures in a compiler with least unnecessary information. ASTs are more compact than a parse tree and can be easily used by a compiler.
Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers are responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the following drawbacks:- it cannot determine if a token is valid,
- it cannot determine if a token is declared before it is being used,
- it cannot determine if a token is initialized before it is being used,
- it cannot determine if an operation performed on a token type is valid or not.