Next: Semantic Processing Up: C73-Compiler Techniques Previous: Scanning

Parsing

Purpose is the recognition of valid sentences (phrases) in the (programming) language.

Two classes based on which non-terminal is examined:

  1. top-down (leftmost derivation)
  2. bottom-up (rightmost derivation)
but order of derivation is independent of grammar. For example, given the grammar:
E = Prefix, '(', E, ')'
  | 'V', tail;
Prefix = F | lambda;
Tail = '+', E | lambda;
Consider the following expression derivations:


leftmost derivation (top-down) | rightmost derivation (bottom-up)
-----------------------------------------------------------------
F(V+V) -> Prefix(E)            | F(V+V) -> Prefix(E)
       -> F(E)                 |        -> Prefix(V Tail)
       -> F(V Tail)            |        -> Prefix(V+E)
       -> F(V+E)               |        -> Prefix(V+V Tail)
       -> F(V+V Tail)          |        -> Prefix(V+V)
       -> F(V+V)               |        -> F(V+V)

and rightmost derivation gives a right sentential form and leftmost, left sentential form. Of course, the (intermediate) derivation trees are different. The syntax for phrases in a programming language is defined by a context-free grammar (CFG)-as with a program, there can be errors in the grammar:

ambiguity
for example E = E, '-', E | Id; with input E-E-E the derivation will depend on the parsing technique-note that the existence or absence of ambiguity cannot be established computationally. The trees are:
   +                 +
  / \               / \
 E   +     and     +   E
    / \           / \
   E   E         E   E
wrong
test sets to establish approximate validity

Parser has to determine validity of input and construct derivation tree.

LL(k) Parsers

First L is left to right, second L is leftmost derivation, with k tokens of lookahead. An LL(k) parser requires a unique prediction for each k tokens of lookahead. LL(1) are the only practical class. In some cases, non LL(1) grammars can be converted to to LL(1). The two problems are:

  1. common prefixes, resolved by factoring out (aka left factoring)-for example:

    S = 'if', E, 'then', E, 'fi'
      | 'if', E, 'then', E, 'else', E, 'fi';
    (is this LL(5)?!?!) becomes
    S = 'if', E, 'then', E, IFTAIL;
    IFTAIL = 'fi';
           | 'else', E, 'fi';

  2. left recursion, resolved by adding a new non-terminal and extra productions-for example:

    E = E, '+', T | T;
    T = P | T, '*', P;
    becomes
    E = T, ETAIL;
    ETAIL = '+', T, ETAIL | lambda;
    T = P, TTAIL;
    TTAIL = '*', P, TTAIL | lambda;

Although many programming languages are not LL(1), most can be rewritten. There exist LL(1) parser generators but ad-hoc is more common-recursive descent.

Recursive Descent

Popular ad-hoc implementation technique. Parsing routine is selected by examining the next token from the lexer. Applicable to the restricted subset of CFG called LL(1) grammars. Recursive descent is summarised as:

The "dangling else"

A classical problem that LL(1) parsers cannot handle-origin with Algol60, but still present in C. If the else part of a condition is optional, to which if does an else belong? For example:

if E then if E then S else S

            if                            if
           /  \                          /  \
          E1  then                      E1  then
             /                             /    \
            if            or              if    else
           /  \                          /  \      \
          E2  then                      E2  then   S2
             /    \                           \
            S1    else                        S1
                    \
                    S1
Neither LL(1)-because cannot look far enough ahead-nor even LL(k)-because do not know how far ahead may need to look-can resolve this. Have to use ad-hoc techniques.

Is it a parsing problem or a language design problem? A neater solution is to add end if (or equivalent) to language. Lesson: language design must consider the parsing problems induced. This is not an issue for bottom-up parsers, see later.

Summary

LR(k) Parsers

L is left to right, R is rightmost derivation, with k tokens of lookahead. Invented in 1965 by Knuth. Too large to construct by hand in practice so remained of theoretical interest only for a number of years-became practical in in late 1970's because of LR parser generators (YACC). Key to bottom-up parsing is deciding when to replace the rhs (matched tokens) with the lhs (non-terminal, also called a handle) of a production. LR also called shift-reduce parsers because of action: shift on to parse stack (tokens and handles), reduce by replacing top n tokens/handles of stack with a new handle.

LR parsers are based on PDAs (FSA + stack), but that is just an implementation technique. Machine is coded as a pair of tables: action (shift, reduce, accept, error) and goto.

Notes for three sheets of handout and slides: The example is taken verbatim from Fischer and LeBlanc. $ is the ``end of input'' symbol. The action table is like a function with the signature state x token -> shift | reduce | accept | error. The goto table like a function with the signature state x token -> state. The arity of the respective productions is 4, 3, 5 and 0-this is important for the reduce operation.

LR encompasses a large family of parsing techniques: LR(1), which are generally impractical, SLR(1), simplified LR, and LALR(1), LookAhead LR. The latter two differ in how they are derived from the CFSM.

Ambiguity in the grammar is reflected in two kinds of conflict in the parser action tables:

  1. shift-reduce-always shift
  2. reduce-reduce-choose first production

Constructing LR Parsers

Building the action and goto tables for LR(0):

  1. configuration-where is dot in production
  2. closure of configuration sets-successor states based on next tokens
  3. characteristic finite state machine (CFSM)
  4. action table-part implementation of CFSM
  5. goto table-other part
In detail:
configuration
state of parse-described by considering position of dot in production, for example: A --> X1 X2 bullet X3 X4
configuration sets
set of productions that match so far (common prefixes)
closure
set of all sentences that could follow from state-compare with lambda closure in NFA construction-for example:

Given

S = E, '$';
E = E, '+', T | T;
T = ID | '(', E, ')';
then the closure of S = bullet E is

S = bullet E, '$'
E = bullet E, '+', T
E = bullet T
T = bullet ID
T = bullet (E)

successor states
given a token X and a configuration set S, compute the configurations that have X as the next token
CFSM
computed from configuration sets and successor states to give CFSM states and transitions
goto table
representation of CFSM states and transitions in tabular form
action table
mapping from configuration sets to parser actions (shift, reduce, accept, error):
reduce-i
if production i has been recognised (dot is at right end)
shift
if dot precedes a token
accept
if dot is at the end of the top production (end of input)
error
otherwise

More than one option is the source of shift-reduce or reduce-reduce conflicts

LR(0) grammars are not easy to write for programming languages, however it is provable that if a grammar is LR(k) then an equivalent (large, unreadable) LR(0) grammar exists.

Practical Variants of LR

LR(1)
As for LR(0) with the addition of lookahead tokens to the configurations: A = X1, X2, bullet X3, X4, z

where z is the possible lookahead after the whole rhs has been matched. Bigger action and goto tables than LR(0).

SLR(1)
Almost as powerful as LR(1) but with more compact action and goto tables-instead of adding lookahead at the configuration stage (which is how the states get multiplied) it is added at the CFSM stage by computing follow sets. Limitation is that follow sets are not an accurate enough technique for lookahead and can lead to unnecessary conflicts. Adds to LR(0).
LALR(1)
Space efficiency of SLR(1) but can handle more classes-but still less than full LR(1). In contrast to SLR(1), LALR(1) is built by starting with a full LR(1) parser and then merging all the states that differ only in their lookahead component. Subtracts from LR(1).

Thus, we have a spectrum of (LR) parsing techniques: LR(0), SLR(1), LALR(1), LR(1).

Parser Generators

Unix's YACC (Yet Another Compiler Compiler) is a widely used parser generator-it generates LALR(1). YACC input is in three sections separated by %%:

  1. declarations
  2. rules and corresponding semantic functions
  3. additional reduces
and the user must supply a scanner-note lex naming convention.

FEEL s-expression example notes: The header file lex-global.h is a set of #defines generated by an option to YACC which associates the token codes (YACC determined) with the user-declared token name (user determined) and is used by the lexer to classify the token which has been scanned. Like lex, input is in three parts, separated by %%: the first is for declaratiions, the second for rules and the third for additional routines. The declarations can be C code, or YACC directives. This example contains four examples of directives: %start identifies the ``top'' production; %union declares a union type; %token declares the type of the named tokens; %type declares the result type of the named rules. A rule has a left hadn side and a right hand side, which are separated by :. The right hand side is a sequence of terminals and non-terminals followed by an action, enclosed in braces and this may be followed by an alternation symbol | and another sequence of terminals and non-terminals and action, etc.. The rule is terminated by a semicolon. Within an action, the elements parsed may be referred to by $1 ... $n, while $$ refers to the top of the parse stack. Notice also the access to the value of the token scanned by lex, for example pptok.int_val. The special non-terminal error is used to control error recovery within the rules and when ``matched'' causes all tokens up to the following terminal to be discarded. The external entry point for a YACC program is yyparse. Exercise, write a YACC parser to recognise sentences of a number of open parentheses followed by the same number of close parentheses.

Return to the "dangling else". Controlled ambiguity in LR parsers can be very useful. Can express in a LALR(1) grammar but it is tedious since incurs much repetition:

S = 'if', E, 'then', S;
S = 'if', E, 'then', RS, 'else', S;
where RS is a restricted set of productions which do not end with an if statement. But writing the productions as:
S = 'if', E, 'then', S, 'else', S;
S = 'if', E, 'then', S;
and letting the shift-reduce conflict be resolved by shifting has an equivalent effect-YACC shifts in shift-reduce conflicts and chooses the earlier production in reduce-reduce conflict. This works because we want the if-then-else to match before the if-then since that causes the else to match the closest if, hence shift rather than reduce. Biggest payoff from ambiguity is in handling operator precedence in arithmetic expressions.

Summary

LALR(1) is the most powerful practical LR parsing technique. LR(1) are usually impractical. LR supports prompt error recognition (on shift of token), deterministic, efficient - O(n) space, since each symbol is shifted once and number of symbols is proportional to size of input and O(n) time, which follows from space argument.

Which is better, LL(1) or LALR(1)? Answer always depends on circumstances, but here are some criteria to help:

Feature               LL(1)                          LALR(1)
------------------------------------------------------------
simplicity            wins                           states, lookaheads, configurations
generality            LL(1) is a subset of LR(1)     wins
error recovery        wins (predictions)             only knows input so far
table size            wins                           clearly!
software engineering  mostly manual                  parser generation

LL(1) grammars are uncommon in practice and redrafting is not easy. Availability of resources should guide choice.


Next: Semantic Processing Up: C73-Compiler Techniques Previous: Scanning


masjap@
Fri Oct 21 18:42:03 BST 1994