[Oberon] ALU 2015 and 2018

Wed May 11 15:44:47 CEST 2022

Hi Paul,

On Mi, 2022-05-11 at 13:23 +0100, Paul Reed wrote:
> 
> > do we agree that "normal" instructions (not LD/ST, not MUL/DIV)
> > have a latency of two clock cycles
> Thanks for your efforts to clarify, but no; I think you could argue it's 
> one-cycle latency though.

I don't think so. How long does it take for the very first instruction
to be fetched and executed? This is the latency (the calculation is valid
for every instruction of course, not only for the very first one).
The address is put on the address bus, the memory answers asynchronously
(as you noted), and the instruction is clocked into the IR. The operands
are fetched (from the registers, asynchronously again), the result is
computed, and then clocked into the destination register. Total: two
clock cycles.

I suspect that you want to leave the fetch cycle out of the calculation,
but no instruction can be executed without being fetched first. Overlapping
the execution of one instruction and fetching the next one plays no role
in computing the latency - this exactly is the difference between latency
and throughput.

> Without wanting to sound like Humpty Dumpty I think we need to be very 
> careful what we mean. :)
> 

Of course, I agree. But I try to use established vocabulary and
definitions. :-)

> I'm not sure you can use standard definitions when applied to RISC5, 
> because of its asynchronous nature, but in answer to

Hold a moment... RISC5 is a synchronous automaton, period. Of course,
it has combinational circuits between its state elements, but the state
elements themselves are all clocked with the same clock and the same
clock edge. This is the definition of a synchronous circuit.

That the memory system is unusual with respect to clocking is of no
concern here: it is simply one of the combinational circuits between
state elements. (This is one of the main obstacles in porting the
design, as you also noted.)

> > 
> > What, then, is your exact definition of a two-stage pipeline?
> I'll have a go:
> 
> Where the next instruction has been fetched and is held in the CPU, 
> being decoded, in parallel with the execution phase of the current 
> instruction; and the next instruction after that (which may need to be 
> discarded) is being fetched.
> 

This is normally called a three-stage pipeline: three instructions are
concurrently "in flight". The throughput is three times the reciprocal
of the latency (assuming there is no stall).

> It may also be why Wojtek is looking for parallelism which simply
> isn't there.

This is my point: a bit of parallelism *is* there! Fetching the next
and executing the current instruction are done concurrently.

So it may be simply that we are counting pipe stages differently, but
I claim the usual nomenclature on my side... ;-) See for example
https://sites.google.com/site/processorv1/instruction-pipelining/two-stage-instruction-pipeline
http://www.columbia.edu/~yc3096/cad/ee477Final.pdf
https://inst.eecs.berkeley.edu/~cs250/fa11/handouts/lab2-riscv-v2-vcs-dc.pdf

Best regards,
Hellwig