[Oberon] ALU 2015 and 2018
hellwig.geisse at mni.thm.de
Sat May 14 01:39:43 CEST 2022
Hi Paul, Jörg, Wojtek,
I now have a working simulation of RISC5 (in NW's original form)
and can prove that this is in fact a 2-stage pipeline.
1. You can do the experiment yourself, if you clone
https://github.com/hgeisse/THM-Oberon and go to subdirectory
fpga-RISC5/v0. You will need the Icarus Verilog compiler/simulator
http://iverilog.icarus.com and the GTK Wave Viewer
http://gtkwave.sourceforge.net . The Makefile compiles and simulates
the RISC5 system with a PROM that contains a short test program (see
below). If you don't want to do that, I have taken a screen shot
(1.png) showing the simulation result, which I discuss below.
2. The test program is read from the PROM. The timing is slightly
different from SRAM (note the .clk(~clk) inversion of the clock in
the instantiation of PROM.v in RISC5Top.v, which I copied without
change from the original). This is of no effect however, and the
experiment can repeated with the program coming from SRAM without
any change of the result.
3. Here is the test program, contained in prom.mem:
40000011 // MOV R0,0x00000011
41000022 // MOV R1,0x00000022
42000044 // MOV R2,0x00000044
0306FF01 // IOR R3,R0,R1
0416FF02 // IOR R4,R1,R2
E7FFFFFF // stop: B stop
The hex value in the first column is the actual content of the memory
word, the comment following is the instruction as assembler mnemonic.
The important instructions are the logical ORs, which fetch two operands
from the registers and write the result back to another register.
We are interested in the answers to two questions:
a) How many instructions are executed per clock cycle, i.e., what is the
instruction throughput of the machine?
b) How man cycles does a single instruction need to be fetched, decoded,
executed, and its result being written to the destination register, i.e.,
what is the instruction latency of the machine?
4. Now please take a look at 1.png. Just a short reminder how to interpret
such pictures, showing a synchronous automaton: If you see a positive clock
edge (0 -> 1) and a signal changing state at "the same instant", it means
that the clock edge loaded new values into the state elements, and a fraction
of the clock cycle time later, other signals follow. The new state of the
dependent signals will be recognized by the state elements only at the next
clock edge - not at the clock edge which caused the change.
Here is a short explanation of the signals shown:
clk the main clock, 25 MHz (cycle time 40 ns)
rst reset signal, active-low
pc the current value of the program counter
adr the address of the next instruction to be fetched
codebus instructions are transferred on the code bus
IR the current value of the instruction register
B first ALU operand
C1 second ALU operand
aluRes ALU result
wr write enable for the register file
clk the main clock again, just for reference when exactly data is written
din data to be written to a register on the next clock edge
rno0 register number of the destination register
(If you do the experiment yourself, you can of course look into *all* signals
of the machine - I just selected a few that I deemed to be important.)
You see that after reset is released (at 1960 ns), the instructions are coming
in with each clock cycle (getting stable with the falling clock edge, due to
the clock inversion already discussed above). Results are written with the clock
rate into the registers:
time value register
2000 0x00000011 0
2040 0x00000022 1
2080 0x00000044 2
2120 0x00000033 3
2160 0x00000066 4
So question a) is answered: the throughput is 1 instruction per clock cycle.
Now going on to b)... :-)
We are interested in the time it takes for a single instruction to be fetched,
executed, and its result being written back to a register. I pick the first IOR
instruction at (relative) address 0x000C (this instruction is located at byte
address 0xFFE00C, word address 0x3FF803). The instruction cycle begins with
the CPU sending out the adress 0xFFE00C at 2040 ns. The PROM answers with the
instruction 0x0306FF01, which is latched into IR at 2080 ns. Both operands
(0x00000011, 0x00000022) are fetched and or'ed together, yielding the result
0x00000033, which is then written to the destination register 3 at 2120 ns.
This ends the instruction cycle, which comprises 2 clock cycles (80 ns).
So the answer to question b) is "the latency is two clock cycles".
This is exactly the behavior of a two-stage pipeline. The two stages doing work
in parallel are the execute stage for an instruction and the fetch stage for
the next one.
More information about the Oberon