Next: Target Architectures Up: C73-Compiler Techniques Previous: Semantic Processing

Basic Code Generation

Intermediate or native-qualitative decision. For the former:

abstract machine-more suited to language (higher level) and easier to CG; also separates out machine dependencies.
nitty-gritty of temporaries and register allocation is kept away from semantic routines (simpler code with lower incidence of bugs!); machine dependencies are kept in CG routines.
can apply machine independent optimisations to intermediate code-so they are portable too. Allows complex opt facilities to migrate, in addition opt is simpler because intermediate forms are idealised (regular and more abstract than real hardware)

and for the latter:

fewer passes in compiler overall-faster
possibility of one-pass compilation-suitable in certain circumstances (load-and-go).

Intermediate Representations

Communication medium between front-end and back-end-the machine dependency boundary. In theory, could be 0, 1, 2 or 3 address idealised machines, but in practice, usually 0 or 3 and occasionally 2:

0 address: stack manipulation, hidden state registers
1 address: multiple stacks, eg. Warren's Abstract Machine
2 address: triples: (operator src1/dst src2), but double usage of one of the sources can be troublesome
3 address: quadruples (or quads): (operator src1 src2 dst), very flexible and easy to use

IR choice can be influenced by target architecture-problem that quads do not map very well to stack architectures, nor does 0-address to register architectures.

Control expressions

See handout for examples from Fischer and LeBlanc of annotated if:

if
E1              code for E1
then            JUMP-FALSE else-label    ;; #if_test
S1              code for S1
       ---->    JUMP end-if-labe         ;; #gen_jump
else            LABEL else-label         ;; #gen_else_label
S2              code for S2
end if          LABEL end-if-label       ;; #start_if, #gen_out_label

and for:

for
I1
in
E1..E2          code for E1
                STORE lowerbound
                LT lowerbound I1
                JUMP-FALSE end-for-label
                code for E2
                STORE upperbound
       ---->    GT I1 upperbound
                JUMP-FALSE end-for-label
                LABEL begin-for-label    ;; #init_loop
loop
S1              code for S1
end loop        ADD I1 1 I1
                EQ I1 10
                JUMP-FALSE begin-for-label
                LABEL end-for-label      ;; #finish_loop

Similar, obvious expansion hold for the other syntactic elements of a programming language. Remark on case density and jump tables versus chained tests.

Function Calling

Semantics

Most languages provide call-by-value (although Bliss and FORTRAN are exceptions) and for most programming that is sufficient. Call-by-need is essential to the efficiency of functional languages.

value: Value of actual parameter is computed and copied to the formal parameter. Changing the value in the callee does not affect the caller.
reference: Address of value of actual parameter is computed and copied to the formal parameter. Changes made in the callee block will subsequently affect the caller. Problem of callee assigning to constants.
value-result: Parameters are passed to the callee as for value but on exit the values of the formals are copied back to the caller thus changing the values in the calling block-purpose is to avoid inefficiency of dereferencing reference parameters.
result: As for value-result except no values are passed in (consider Ada's out parameter class).
name: Historical interest only. Introduced by Algol60. Each expression in the procedure call which corresponds to a name parameter is compiled into a nullary procedure (thunk), closed in the calling environment. Each reference to the name parameter in the callee causes the thunk to be called and the expression reevaluated. In consquence, if the callee changes a variable on which the thunk depends then the value of the name parameter will also change (Jensen's device). Costly to implement.
need: Only evaluate argument expression if callee references it-like name except that the value of the first evaluation is always returned. Attractive because only necessary work is done-but overhead can be high; radical implementation techniques needed-strictness analysis to turn need into value. Changes termination properties of program since although a particular argument expression might cause an error, if it is not referenced, no error will occur. Thus a value and a need implementation can give different results for the same program.

By implication, the only semantic models we are supporting are by value and by reference. Protocol chosen has high impact on run-time efficiency. Pass parameters on stack or in registers? trade-off memory traffic against debuggability-obviated by caching? Caller or callee saves? Three options:

save registers currently in use by caller (few?)
save registers that might be used by callee (many?)
all!

[tabular195]

1, 3 and 5 are preferred because it saves replication of code for saving the regs-only in callee (once) rather than in callers (many), but 1 and 2 are preferred because it saves fewer registers. However, not an open-and-shut case for 1. Callee save and restore makes non-local exits hard because the restoration information is distributed, but with caller save/restore it is relatively straightforward (consider longjmp/setjmp and jmpbuf).

How can callee know which of the caller's registers to save? Caller must provide some kind of bitmap-hardware support for realistic speed-but it may be faster to save all than to work it out.
Simple and straightforward-also attractive since number of registers is usually small-probably most widely used in practice.
Compiler knows registers used by callee at the end of compiling the callee procedure. Can either generate a special piece of prologue and epilogue code or (from VAX) a bit-map stored at the beginning of the callee interpreted by the call instruction (slow!).
How can caller know what callee will use-bit-map technique again.
callee save all is easy if there are few live registers-hardware trend is against this approach-but is likely to mean unnecessary values will be saved and this can compromise garbage collection.
caller save is also easy.

Tail-call Optimisation

An important optimisation in functional languages and those in which the programmer cannot take the address of a local variable (since that prevents the opt.). Easiest in the caller saves model. In summary: callee's parameters are prepared; caller's frame is discarded; control is transferred to callee. Consequence is a significant saving in total stack depth-very important in languages which encourage recursive solutions. Effect is to transform recursive function consuming unbounded resources (amounts of stack) into loops consuming finite resources. Hard in C-requires sophisticated analysis to verify safety (or no parameters and no local variables!).

Expressions and Sequences

We have covered some control expressions and function calling, so this is all that is left.

CG from Trees

Source code is turned into a tree (attributed tree) and classical IR is linearization of the tree. Can CG directly from tree by post-order traversal: produces valid code but not necessarily the best code-want to minimize spilling. For example: (A-B)+((C+D)+(E*F)) requires 1 reg for A-B and 2 for the rest making a total of three, but if we calculate the sub-expressions in the reverse order at most 2 regs are required.

An algorithm to minimise the number of registers requried for a tree: label the nodes with their Sethi-Ullman number working from the bottom-up using combination rules for the interior nodes.

left subtree: 1-load leaf value into a register
right subtree: 0-use register memory op to get value and combine with left subtree
interior node: If L(l)=L(r) then L(l)+1 else max(L(l),L(r))-do more complex operand (one with higher number) first.

Results can be improved further with commuting information, but associativity cannot help because of overflow/accuracy side-effects.

Local Optimisation

Relatively easy to perform: ignore control flow, can be integrated with CG. Starts with construction of basic blocks. Basic block: linear code sequence, that is starting with a label (or after a conditional branch) and finishing with either a label or any kind of branch.

Aim of optimisation is to preserve the meaning of the program (not always possible-unsafe optimisations), avoid redundant computation-for example and address calculation-minimise load/store operations and prevent the destruction of useful values.

Particularly important is common sub-expression elimination: use value-numbering (Cocke and Schwartz) to do symbolic execution of IR in which each new value is enumerated; can then determine redundant computations by comparing value numbers (if a particular combination of v-i and v-j is repeated then the corresponding value is reused-hash tables). Can also be used for register allocation/spilling (see later).

CG from DAGS

This approach is applicable to basic blocks, rather than single expressions. We create a DAG (tree except that nodes may have in-degree greater than 1) by merging the tree nodes of a basic block to reflect sharing and hence resolve CSE's. Thus (A+B)+(A+B) is

We now look a detailed example from Fischer and LeBlanc. There is a practical constraint on what can be acheived due to complexity. If a value is shared, the technique aims to keep it in a register until all references have been satisfied. However, even assuming an infinite number of registers, generating optimal code is exponential in the size of the DAG. The difficulty stems from left operands usually being destoyed, but sharing requiring copying before use as a left operand. Aim, therefore, is to order evaluation to minimize the number of copies.

Heuristic Scheduling Algorithm:

try to order evaluation so that the last time an operand is used is when it is a left operand.
start at rightmost root of DAG and work left and top-down.
within a sub-DAG an operator is scheduled if all its parents have been scheduled (first shceduled is last evaluated), then recurse on left and right operands (see overhead).

Register allocation: tries to use the same virtual register all the way down the left branch (use + destroy) to minimize copying. Succeeds as long as operator is last use of operand (ie. low schedule number).

choose lowest schedule numbered node.
allocate next virtual register.
go left.
if node's lowest numbered parent is where we came from, use the current virtual register and goto 3, otherwie, select the lowest numbered unallocated node and goto 2.

Mapping virtual to physical registers: Define the span of a virtual register as the range of schedule numbers of the nodes that use the register. Two virtual registers cannot be implemented by the same physical registers if the spans overlap. V1 has a span of 1..6, V2: 1..3 and V3:4..5. Therefore, V1 maps to R1 and V2 and V3 map to R2.

Output values: that is the updates that must be made at the end of the basic block to make succeeding ones correct. The code must update certain variables: G, C, A at the end. However, mut not store too soon-that is, before all uses are complete. Thus, we delay storing until a register is going to be destroyed. The final code sequence is:

ADD A,B,R1
MUL C,R1,R2
ADD R1,R2,G
SUB E,F,R2
STO R1,C
MUL D,R1,R1
ADD R1,R2,A

Heuristics can be improved by using commuting information or just completely different approaches. Aliasing complicated matters significantly.

Value Numbering

This section is based on p559 of Fischer and LeBlanc, but the technique goes back to Cocke and Schwartz in 1970.

The idea is to avoid redundant computation by looking for a previous computation of the same value within the basic block. If it is found, the generation of code is suppressed. The example here ignores (a) aliasing of array elements, (b) reference parameters (c) pointers. If two operands have not changed since the result was defined, then the previous value of the temporary can be used. To handle addresses, a number is assigned for the location and another for the contents. The code fragment in question is:

A(I,J) = A(I,J) + B + C
A(I,J+1) = A(I,J) + B + D
A(I,J) = A(I,J) + B

See table on overhead-tuples marked with R are redundant, bringing the length of the code sequence down from 21 to 11. Aliasing makes process more complicated because it is very difficult to ensure that all access paths are taken into account. Note that a more careful analysis would not mark tuple 20 as redundant because update of A(I,J+1) kills all values of A(I,J). Although we know they are not aliases, recognizing (in general) that J <> J+n is more profound than value numbering is capable of being.

Next: Target Architectures Up: C73-Compiler Techniques Previous: Semantic Processing

masjap@
Fri Oct 21 18:42:03 BST 1994