Character input is turned into tokens. Recognition of tokens is carried out by scanner/lexer. Definition of a token? by rules in a (regular) grammar, which can then be recognized by a FSA-revise these concepts.
Grammar notation (based on BS 6154):
Chomsky's classification from least to most complex is:
Use limited to identifier, string, number syntax in programming languages.
Use limited to statements, blocks and programs in programming languages-more complex parsing technique.
where gamma is non-null. A is transformed by this rule only when it occurs between alpha and beta. More complex than is necessary for most programming languages (except Algol68), but too restricted for natural language-use in NL front-ends to DB's.
Parsing them is an unsolved problem although can be done for subsets.
Only interested in types 2 and 3. Some examples of token syntax.
character = '0' | ... | 'a' | ... | 'z' | ... | 'A' | ... | 'Z' | ...; string = '"', {character - '"'}, '"';except this is not regular, so rewrite as follows:
string = '"', string aux; string aux = '"' | {character - '"'}, '"';Discuss handout and slides of C token syntax, drawing attention to identifier, integer and string literals.
The theory of recognizing regular grammars-is S a member of some finite set of strings? Transition diagram based on next character.
Definition of FSA: start state (S0), set of states (S1...Sn), set of transitions (T1...Tk), final state (A). Two kinds: DFA, only one next state for each input symbol and no empty transitions, NFA otherwise. DFA easy to program. Can convert NFA to DFA. String grammar can be handled by a DFA. This leads to a three phase algorithm:
S = ['^'], {'.' | alphanumeric}, ['$'];which defines a simple grammar for a subset of regular expressions such as are handled by ed, sed, grep, etc.. NFA's are caused by use of closures in the grammar. Derive diagram. Conversion to DFA. (Four sheets of handout). Manual construction is tedious, but relatively automatic.
Lex is Unix's scanner generator-constructs a C program from the specification of the token grammar. Do man lex to find out something about it. Lex input is in three parts separated by %%:
/* empty part 1 */ %% ["].*["] printf("%s",yytext); [\n] ; . ; %% /* empty part 3 */
We try this out as follows:
ss1 $ lex string1.lex ss1 $ cc lex.yy.c -ll ss1 $ a.out << EOF > foo > 123 > "abc" > 45 > bar > "z" > EOF "abc""z"The lex output is too obscure to be worth discussing in detail: part is a chunk of program (which never changes), the rest is tables of numbers which encode the transitions for the FSA. However, this program has a bug: try putting two strings on the same line.
A similar program is:
%% ["] { char c = input(); while (c != '"') { putchar(c); c = input(); }; } [\n] ; . ;This avoids the above bug, but even better is
ss1 $ lex << EOF > %% > ["][^"]*["] printf("%s",yytext); > [\n] ; > . ; > %% > EOF ss1 $ cc lex.yy.c -ll ss1 $ a.out "abc"45"z" "abc""z"ss1 $
Lex digit slide:
Exercise: write a lex grammar for roman numeral input giving decimal output.
The relevance of this section may be passing since few of you will have seen FORTRAN and fewer still are likely to have to use it in the future. Nevertheless, this is worth discussing from a historical viewpoint-and the problem still remains. FORTRAN tokenisation is problematic-not a regular grammar. Consider the statements:
DO 10 I = 1,10 DO 10 I = 1.10requires multi-character lookahead. Discuss handout of Sale's algorithm for FORTRAN-66 and FORTRAN-77 statement classification (Computer Journal 1971, see also CJ 1991 34/4 Algorithm 127 for F77 version)-three phases:
IF(X.EQ.1) expression IF(expression) lab1, lab2, lab3also GOTO and assigned GOTO.
IF (exp) THEN IF (exp) THEN=3These constrain the input language to have target labels start on the same line as the GOTO and a logical IF controlling a variable THEN must have at least one more non-blank character in the initial line.