Purpose is the recognition of valid sentences (phrases) in the (programming) language.
Two classes based on which non-terminal is examined:
E = Prefix, '(', E, ')' | 'V', tail; Prefix = F | lambda; Tail = '+', E | lambda;Consider the following expression derivations:
leftmost derivation (top-down) | rightmost derivation (bottom-up) ----------------------------------------------------------------- F(V+V) -> Prefix(E) | F(V+V) -> Prefix(E) -> F(E) | -> Prefix(V Tail) -> F(V Tail) | -> Prefix(V+E) -> F(V+E) | -> Prefix(V+V Tail) -> F(V+V Tail) | -> Prefix(V+V) -> F(V+V) | -> F(V+V)
and rightmost derivation gives a right sentential form and leftmost, left sentential form. Of course, the (intermediate) derivation trees are different. The syntax for phrases in a programming language is defined by a context-free grammar (CFG)-as with a program, there can be errors in the grammar:
+ + / \ / \ E + and + E / \ / \ E E E E
First L is left to right, second L is leftmost derivation, with k tokens of lookahead. An LL(k) parser requires a unique prediction for each k tokens of lookahead. LL(1) are the only practical class. In some cases, non LL(1) grammars can be converted to to LL(1). The two problems are:
S = 'if', E, 'then', E, 'fi' | 'if', E, 'then', E, 'else', E, 'fi';(is this LL(5)?!?!) becomes
S = 'if', E, 'then', E, IFTAIL; IFTAIL = 'fi'; | 'else', E, 'fi';
E = E, '+', T | T; T = P | T, '*', P;becomes
E = T, ETAIL; ETAIL = '+', T, ETAIL | lambda; T = P, TTAIL; TTAIL = '*', P, TTAIL | lambda;
Popular ad-hoc implementation technique. Parsing routine is selected by examining the next token from the lexer. Applicable to the restricted subset of CFG called LL(1) grammars. Recursive descent is summarised as:
A classical problem that LL(1) parsers cannot handle-origin with Algol60, but still present in C. If the else part of a condition is optional, to which if does an else belong? For example:
if E then if E then S else S if if / \ / \ E1 then E1 then / / \ if or if else / \ / \ \ E2 then E2 then S2 / \ \ S1 else S1 \ S1Neither LL(1)-because cannot look far enough ahead-nor even LL(k)-because do not know how far ahead may need to look-can resolve this. Have to use ad-hoc techniques.
Is it a parsing problem or a language design problem? A neater solution is to add end if (or equivalent) to language. Lesson: language design must consider the parsing problems induced. This is not an issue for bottom-up parsers, see later.
L is left to right, R is rightmost derivation, with k tokens of lookahead. Invented in 1965 by Knuth. Too large to construct by hand in practice so remained of theoretical interest only for a number of years-became practical in in late 1970's because of LR parser generators (YACC). Key to bottom-up parsing is deciding when to replace the rhs (matched tokens) with the lhs (non-terminal, also called a handle) of a production. LR also called shift-reduce parsers because of action: shift on to parse stack (tokens and handles), reduce by replacing top n tokens/handles of stack with a new handle.
LR parsers are based on PDAs (FSA + stack), but that is just an implementation technique. Machine is coded as a pair of tables: action (shift, reduce, accept, error) and goto.
Notes for three sheets of handout and slides: The example is taken verbatim from Fischer and LeBlanc. $ is the ``end of input'' symbol. The action table is like a function with the signature state x token -> shift | reduce | accept | error. The goto table like a function with the signature state x token -> state. The arity of the respective productions is 4, 3, 5 and 0-this is important for the reduce operation.
LR encompasses a large family of parsing techniques: LR(1), which are generally impractical, SLR(1), simplified LR, and LALR(1), LookAhead LR. The latter two differ in how they are derived from the CFSM.
Ambiguity in the grammar is reflected in two kinds of conflict in the parser action tables:
Building the action and goto tables for LR(0):
Given
S = E, '$'; E = E, '+', T | T; T = ID | '(', E, ')';then the closure of S = bullet E is
S = bullet E, '$' E = bullet E, '+', T E = bullet T T = bullet ID T = bullet (E)
where z is the possible lookahead after the whole rhs has been matched. Bigger action and goto tables than LR(0).
Unix's YACC (Yet Another Compiler Compiler) is a widely used parser generator-it generates LALR(1). YACC input is in three sections separated by %%:
FEEL s-expression example notes: The header file lex-global.h is a set of #defines generated by an option to YACC which associates the token codes (YACC determined) with the user-declared token name (user determined) and is used by the lexer to classify the token which has been scanned. Like lex, input is in three parts, separated by %%: the first is for declaratiions, the second for rules and the third for additional routines. The declarations can be C code, or YACC directives. This example contains four examples of directives: %start identifies the ``top'' production; %union declares a union type; %token declares the type of the named tokens; %type declares the result type of the named rules. A rule has a left hadn side and a right hand side, which are separated by :. The right hand side is a sequence of terminals and non-terminals followed by an action, enclosed in braces and this may be followed by an alternation symbol | and another sequence of terminals and non-terminals and action, etc.. The rule is terminated by a semicolon. Within an action, the elements parsed may be referred to by $1 ... $n, while $$ refers to the top of the parse stack. Notice also the access to the value of the token scanned by lex, for example pptok.int_val. The special non-terminal error is used to control error recovery within the rules and when ``matched'' causes all tokens up to the following terminal to be discarded. The external entry point for a YACC program is yyparse. Exercise, write a YACC parser to recognise sentences of a number of open parentheses followed by the same number of close parentheses.
Return to the "dangling else". Controlled ambiguity in LR parsers can be very useful. Can express in a LALR(1) grammar but it is tedious since incurs much repetition:
S = 'if', E, 'then', S; S = 'if', E, 'then', RS, 'else', S;where RS is a restricted set of productions which do not end with an if statement. But writing the productions as:
S = 'if', E, 'then', S, 'else', S; S = 'if', E, 'then', S;and letting the shift-reduce conflict be resolved by shifting has an equivalent effect-YACC shifts in shift-reduce conflicts and chooses the earlier production in reduce-reduce conflict. This works because we want the if-then-else to match before the if-then since that causes the else to match the closest if, hence shift rather than reduce. Biggest payoff from ambiguity is in handling operator precedence in arithmetic expressions.
LALR(1) is the most powerful practical LR parsing technique. LR(1) are usually impractical. LR supports prompt error recognition (on shift of token), deterministic, efficient - O(n) space, since each symbol is shifted once and number of symbols is proportional to size of input and O(n) time, which follows from space argument.
Which is better, LL(1) or LALR(1)? Answer always depends on circumstances, but here are some criteria to help:
Feature LL(1) LALR(1) ------------------------------------------------------------ simplicity wins states, lookaheads, configurations generality LL(1) is a subset of LR(1) wins error recovery wins (predictions) only knows input so far table size wins clearly! software engineering mostly manual parser generation
LL(1) grammars are uncommon in practice and redrafting is not easy. Availability of resources should guide choice.