Next: Parsing Up: C73-Compiler Techniques Previous: Compiler Structure

Scanning

Character input is turned into tokens. Recognition of tokens is carried out by scanner/lexer. Definition of a token? by rules in a (regular) grammar, which can then be recognized by a FSA-revise these concepts.

Grammar notation (based on BS 6154):

Terminals are enclosed in single quotes, eg. 'if', 'for'.
Non-terminals are either symbols or a sequence of symbols separated by spaces, eg. letter, for statement.
Greek is used for arbitrary sequences of terminals and non-terminals, except that lambda denotes the empty sequence.
Optional elements are enclosed in brackets, [...].
Alternation is denoted by vertical bar, |, concatenation by comma, , and repetition (the Kleene closure) by enclosing in braces, {...}.
Left hand side is separated from the right hand side by equals, = and a production is terminated by a semicolon, ;.

Classification of Grammars

Chomsky's classification from least to most complex is:

type 3

regular grammars, where rules (productions) are constrained to be of the form: A = 'a'; or A = 'a', B;

Use limited to identifier, string, number syntax in programming languages.

type 2

context-free grammars, where rules are constrained to be of the form: A = alpha;

Use limited to statements, blocks and programs in programming languages-more complex parsing technique.

type 1

context-sensitive grammars, where rules are constrained to be of the form: alpha, A, beta = alpha, gamma, beta;

where gamma is non-null. A is transformed by this rule only when it occurs between alpha and beta. More complex than is necessary for most programming languages (except Algol68), but too restricted for natural language-use in NL front-ends to DB's.

type 0

free grammars, where the productions are unconstrained: alpha = beta;

Parsing them is an unsolved problem although can be done for subsets.

Only interested in types 2 and 3. Some examples of token syntax.

character = '0' | ... | 'a' | ... | 'z' | ... | 'A' | ... | 'Z' | ...;
string = '"', {character - '"'}, '"';

except this is not regular, so rewrite as follows:

string = '"', string aux;
string aux = '"' | {character - '"'}, '"';

Discuss handout and slides of C token syntax, drawing attention to identifier, integer and string literals.

Finite State Automata

The theory of recognizing regular grammars-is S a member of some finite set of strings? Transition diagram based on next character.

Definition of FSA: start state (S0), set of states (S1...Sn), set of transitions (T1...Tk), final state (A). Two kinds: DFA, only one next state for each input symbol and no empty transitions, NFA otherwise. DFA easy to program. Can convert NFA to DFA. String grammar can be handled by a DFA. This leads to a three phase algorithm:

Translate the regular expression into a NFA.
Translate the NFA into a DFA.
Translate the DFA into a program.

Let us consider an example:

S = ['^'], {'.' | alphanumeric}, ['$'];

which defines a simple grammar for a subset of regular expressions such as are handled by ed, sed, grep, etc.. NFA's are caused by use of closures in the grammar. Derive diagram. Conversion to DFA. (Four sheets of handout). Manual construction is tedious, but relatively automatic.

Scanner Generators

Lex is Unix's scanner generator-constructs a C program from the specification of the token grammar. Do man lex to find out something about it. Lex input is in three parts separated by %%:

character classes and auxiliary regular expressions
regular expressions and associated actions (enclosed in braces)
action routines (ordinary C code)

Repeat string example with lex input and output. The (simplest?) input to generate a recognizer for the string syntax given earlier is

/* empty part 1 */
%%
["].*["]	printf("%s",yytext);
[\n]	;
.	;
%%
/* empty part 3 */

We try this out as follows:

ss1 $ lex string1.lex
ss1 $ cc lex.yy.c -ll
ss1 $ a.out << EOF
> foo
> 123
> "abc"
> 45
> bar
> "z"
> EOF
"abc""z"

The lex output is too obscure to be worth discussing in detail: part is a chunk of program (which never changes), the rest is tables of numbers which encode the transitions for the FSA. However, this program has a bug: try putting two strings on the same line.

A similar program is:

%%
["]    { char c = input();
         while (c != '"') {
           putchar(c);
           c = input();
         };
       }
[\n]	;
.	;

This avoids the above bug, but even better is

ss1 $ lex << EOF
> %%
> ["][^"]*["] printf("%s",yytext);
> [\n]    ;
> .       ;
> %%
> EOF
ss1 $ cc lex.yy.c -ll
ss1 $ a.out
"abc"45"z"
"abc""z"ss1 $

Lex digit slide:

First four are character classes.
Down to %% are patterns.
After %% are rules.
There are no actions.
Suffix * denotes zero or more.
Prefix ? denotes an optional item.
Pattern . matches any character (e.g. .*).
| denotes alternation.
$ matches end of line.
Literals are delimited by [ and ].
Pattern names are delimited by {and }.
Characters matched are stored in char* yytext.

Exercise: write a lex grammar for roman numeral input giving decimal output.

FORTRAN Lexing

The relevance of this section may be passing since few of you will have seen FORTRAN and fewer still are likely to have to use it in the future. Nevertheless, this is worth discussing from a historical viewpoint-and the problem still remains. FORTRAN tokenisation is problematic-not a regular grammar. Consider the statements:

DO 10 I = 1,10

DO 10 I = 1.10

requires multi-character lookahead. Discuss handout of Sale's algorithm for FORTRAN-66 and FORTRAN-77 statement classification (Computer Journal 1971, see also CJ 1991 34/4 Algorithm 127 for F77 version)-three phases:

Removal of statements not characterized by keywords (would confuse phase 2 otherwise), namely comments and assignments
Recognition by initial keyword.
Distinction between keyword variants (ad-hoc), for example:
```
IF(X.EQ.1) expression
IF(expression) lab1, lab2, lab3
```
also GOTO and assigned GOTO.

The modifications to extend it to FORTRAN 77 are:

Additional comment character (*). Handling of character constants and of substrings.
Optional comma in DO after label allows classification by phase 2 if present or by phase 3 if not.
Cannot differentiate GOTO and assigned GOTO by presence of comma, since this is optional, so use last character scanned. IF statements are complicated by block IF-scan for THEN at the end of a line:
```
IF (exp) THEN

IF (exp) THEN=3
```
These constrain the input language to have target labels start on the same line as the GOTO and a logical IF controlling a variable THEN must have at least one more non-blank character in the initial line.

Summary

Token syntax easily specified by regular grammars
RG's can be recognised by FSA's
Scanner generators reduce chances of errors-SE approach
ad-hoc necessary in some circumstances.

Next: Parsing Up: C73-Compiler Techniques Previous: Compiler Structure

masjap@
Fri Oct 21 18:42:03 BST 1994