[Oberon] ALU 2015 and 2018

Sat May 14 01:39:43 CEST 2022

Hi Paul, Jörg, Wojtek,

I now have a working simulation of RISC5 (in NW's original form)
and can prove that this is in fact a 2-stage pipeline.

1. You can do the experiment yourself, if you clone
https://github.com/hgeisse/THM-Oberon and go to subdirectory
fpga-RISC5/v0. You will need the Icarus Verilog compiler/simulator
http://iverilog.icarus.com and the GTK Wave Viewer
http://gtkwave.sourceforge.net . The Makefile compiles and simulates
the RISC5 system with a PROM that contains a short test program (see
below). If you don't want to do that, I have taken a screen shot
(1.png) showing the simulation result, which I discuss below.

2. The test program is read from the PROM. The timing is slightly
different from SRAM (note the .clk(~clk) inversion of the clock in
the instantiation of PROM.v in RISC5Top.v, which I copied without
change from the original). This is of no effect however, and the
experiment can repeated with the program coming from SRAM without
any change of the result.

3. Here is the test program, contained in prom.mem:

40000011	//		MOV	R0,0x00000011
41000022	//		MOV	R1,0x00000022
42000044	//		MOV	R2,0x00000044
0306FF01	//		IOR	R3,R0,R1
0416FF02	//		IOR	R4,R1,R2
E7FFFFFF	//	stop:	B	stop

The hex value in the first column is the actual content of the memory
word, the comment following is the instruction as assembler mnemonic.
The important instructions are the logical ORs, which fetch two operands
from the registers and write the result back to another register.
We are interested in the answers to two questions:
a) How many instructions are executed per clock cycle, i.e., what is the
instruction throughput of the machine?
b) How man cycles does a single instruction need to be fetched, decoded,
executed, and its result being written to the destination register, i.e.,
what is the instruction latency of the machine?

4. Now please take a look at 1.png. Just a short reminder how to interpret
such pictures, showing a synchronous automaton: If you see a positive clock
edge (0 -> 1) and a signal changing state at "the same instant", it means
that the clock edge loaded new values into the state elements, and a fraction
of the clock cycle time later, other signals follow. The new state of the
dependent signals will be recognized by the state elements only at the next
clock edge - not at the clock edge which caused the change.

Here is a short explanation of the signals shown:
clk      the main clock, 25 MHz (cycle time 40 ns)
rst      reset signal, active-low
pc       the current value of the program counter
adr      the address of the next instruction to be fetched
codebus  instructions are transferred on the code bus
IR       the current value of the instruction register
B        first ALU operand
C1       second ALU operand
aluRes   ALU result
wr       write enable for the register file
clk      the main clock again, just for reference when exactly data is written
din      data to be written to a register on the next clock edge
rno0     register number of the destination register

(If you do the experiment yourself, you can of course look into *all* signals
of the machine - I just selected a few that I deemed to be important.)

You see that after reset is released (at 1960 ns), the instructions are coming
in with each clock cycle (getting stable with the falling clock edge, due to
the clock inversion already discussed above). Results are written with the clock
rate into the registers:
time      value          register
2000      0x00000011     0
2040      0x00000022     1
2080      0x00000044     2
2120      0x00000033     3
2160      0x00000066     4

So question a) is answered: the throughput is 1 instruction per clock cycle.
Now going on to b)... :-)

We are interested in the time it takes for a single instruction to be fetched,
executed, and its result being written back to a register. I pick the first IOR
instruction at (relative) address 0x000C (this instruction is located at byte
address 0xFFE00C, word address 0x3FF803). The instruction cycle begins with
the CPU sending out the adress 0xFFE00C at 2040 ns. The PROM answers with the
instruction 0x0306FF01, which is latched into IR at 2080 ns. Both operands
(0x00000011, 0x00000022) are fetched and or'ed together, yielding the result
0x00000033, which is then written to the destination register 3 at 2120 ns.
This ends the instruction cycle, which comprises 2 clock cycles (80 ns).
So the answer to question b) is "the latency is two clock cycles".

This is exactly the behavior of a two-stage pipeline. The two stages doing work
in parallel are the execute stage for an instruction and the fetch stage for
the next one.

Best regards,
Hellwig