File: readme
Author: Edward A. Green
Description:  readme file of Compiler Design (66.648)
              Project 3: generation of executable assembler code.

FILES

README FILES
readme		this file
readme1		the readme file for project1
readme2		the readme file for project2

PROGRAM CODE
block.c		c code for handling program blocks
code.c		c code for handling quads
connect.c	c code for drawing the basic block connection table
constant.c      c code for storing temporary variable types, etc.
gen_code.c	main program file for generating program code
labels.c	c code for handling program labels, jmps, calls, etc.
leads.c		c code for calculating leading lines of basic blocks
memory.c	c code for calculating memory requirements, symbolic addresses
mod.c		c code for handling the MOD structure (modified/used data)
next_use.c	c code for calculating next-uses
optimize.c	c code for doing constant propagation and removing useless var.
register.c	c code for handling register allocation
symbols.c	c code for storing the symbol table
util.c		general c routines.
proj3.c		main program file.
proj3.h		include file for project.
proj1.yac	revised project 1 syntactic analyzer.
proj1.lex	revised project 1 lexical analyzer.

COMMAND FILES
makefile	make file for generating two executable files, phase1 & phase2
comp		a Unix shell routine for compiling a minipascal program

MINIPASCAL TEST PROGRAMS
test1.p		the GCD program from the text (recursive)
test2.p		a program showing parameter passing is done in correct sequence
test3.p		testing passing array elements and real->integer coercion
test4.p		a program testing a[a[i]]
test5.p		a program testing passing whole arrays
test6.p		a program that does an insert sort of array data (<=20 items)
test7.p		a program that recursively merge sorts array data (<=20 items)
test8.p		a program for testing all arithmetic operations on integers.
test9.p		a program for testing mixed-mode math.

OPERATION

The system can be made and run successfully on a SUN system with Lex, Yacc,
cc, and gcc (Turing, for example).  Running the make file generates two 
executable files, phase1 and phase2.  Phase1 takes a mini-Pascal program 
(Aho, et. al., pp. 746-748) in standard input and generates a quad program 
with a symbol table.  Phase2 generates assembler code that will run on the
Sequent B8 computer.  Included in the distribution is a shell script which
car run both phase1 and phase2 with one command:

% comp myfile		// assumes myfile.p is the source code

This command generates two files, myfile.q (the quad file from phase 1) and
myfile.s, the assembler code for the b8.  After phase 1 runs, the quad file
is dumped to stdout; any syntax errors will show up in the quad file, and
the quad file can be used to help pinpoint the location of syntax errors.

CHANGES TO PHASES 1 AND 2 (FROM PROJECT 2)

A few bugs were revealed while developing project 3, which were fixed.  These
are the only changes to the first two phases (up through code optimization).  
The optimization from phase 2 are done before the code is generated.

CODE GENERATION

The data structure developed for the optimization was used for generating the
code.  The development revolved around 3 new program files:

gen_code.c   the central code generation routines.
memory.c     the routines for calculating the memory usage required by units.
register.c   the routines for managing the usage of registers and tempories.


MEMORY ALLOCATION

The gen_code routine is called from main() for generating the code for
the entire program.  At the start of each program block, the memory 
requirements for program variables and temporaries are calculated.  The general
memory management strategy is to use registers as much as possible for temps,
with caller-saved registers when subroutine calls are encountered.  The 
registers r0, r1, f0, and f1 are used for "intra-instruction" working storage; 
they are not reserved between quad instructions.  With these quad instructions,
as soon as a temp is used, it is dead, and thus it's register may be freed.  
In all candor, it is obvious by looking at the assembler code generated that a 
lot of register-to-register moves could be eliminated by more intelligent code 
generation.

The b8 has a very easy-to-use assembler language that facilitates activation 
record creation and destruction for subroutine calls.  The memory calculation
for each subroutine is used when generating the ENTER command (which allocates
the subroutines memory), and for the ret command, which deallocates the 
memory from the stack at the end of the subroutine.

In the subroutines symbol table, the location assignment for each variable is 
made during the memory calculation.  These addresses are relative to the b8's
frame pointer.  The passed parameters are also assigned in the memory
calculation routine.

Since the language is static and subroutines are not nested, the main program's
variables are stored as global variables, with ".comm" assembler directives.
This makes the address of global variables a little easier to resolve.

TRANSLATING CODE

Once the start of a subroutine is written, the quad's of code are analyzed.
As noted above, the registers r0, r1, f0, f1 are used for each quads workspace.
All code is generated by routines in the gen_code.c routines (with exception of
spilling registers; this is done in register.c file routines).
There are several general groups of instructions:

calls and jumps
returns
push's and pull's
arithmetic instructions
moves
array instructions

The code generation is relatively straightforward, with a few exceptions:
1) The statement label for the main program is changed from "start" in the
   quad code to "_main:", which conforms to the gcc main program unit.
2) When calling "read" or "write", the type of the last push is tested, and
   an additional push of a "0" (integer) or "1" (real) is made so that the
   read or write routine can use the correct format.  When the "call exit"
   statement is encountered, it is ignored (the return from the main program
   is generated at the end of the code).
3) Upon entry to a subroutine, it is assumed that all registers are free.
   Therefore, all registers must be spilled to the caller's local memory
   prior to the call.  Moreover, if a temp is to be passed, the temp
   must also be in memory and it's address pushed.  Therefore, registers
   are spilled before the first push prior to a subroutine call.  This
   code is generated in a subroutine in the "registers.c" file.
4) Return values from functions are "standard types", that is, either an
   integer or a real value.  These values are returned in r0 or f0, 
   respectively.
5) Arrays present an interesting problem in terms of returning an address
   or its data.  For example, in the statement:
      read (a[i]);
   the address of a[i] is needed for passing data to the element.  But in
      a[i]:=j;
   the data in a[i] should be sent to j.  But both of these instructions
   have the same quad opcode:
      []   a   i   t1
      PUSH         t1
      CALL         read
   and
      []   a   i   j
   The way this was resolved was to label all temps created with the [] opcode
   with an 'a' and all other created temps a 'd', and placing the address
   of the array element in the temp's location.  When using temps, this flag 
   is detected, and if the flag is a, the temp is loaded into either r0 or r1, 
   and the symbolic address used is then either '0(r0)' or '0(r1)' respectively,
   that is, an indirect address.  If the assignment is not to a temp, the data
   is transferred.
6) Array values are referred by loading the index into r0, and adjusting
   the index in r0 by subtracting the start index of the array.  The above
   policies allow array expressions like a[a[i]] to work correctly.
7) Real <-> integer coercion: If a real value is assigned to an integer, the
   fraction is truncated.  Mixed-mode calculations are allowed, and inter-
   mediate results are kept (as much as possible) as real values.  The 
   instructions are dependent on the data types.  C programs were run to see
   how the b8 worked with mixed-mode expressions.
8) The read and write routines were coded by hand and are included in the 
   generated code.