[Oberon] ALU 2015 and 2018
Jörg
joerg.straube at iaeth.ch
Wed May 11 19:23:15 CEST 2022
Woytek
I agree that this old Verilog code
aluRes =
MOV ? C1 :
LSL ? t3 :
ASR ? s3 :
ROR ? s3 :
AND ? B & C1 :
ANN ? B & ~C1 :
IOR ? B | C1 :
XOR ? B ^ C1 :
ADD ? B + C1 + (u & C) :
SUB ? B - C1 – (u & C) :
MUL ? product[31:0] :
DIV ? quotient : 0;
looks nicer than this new one
aluRes =
~op[3] ?
(~op[2] ?
(~op[1] ?
(~op[0] ?
(q ? // MOV
(~u ? {{16{v}}, imm} : {imm, 16'b0}) :
(~u ? C0 : (~v ? H : {N, Z, C, OV, 20'b0, 8'h53}))) :
lshout) : // LSL
rshout) : // ASR, ROR
(~op[1] ?
(~op[0] ? B & C1 : B & ~C1) : // AND, ANN
(~op[0] ? B | C1 : B ^ C1))) : // IOR. XOR
(~op[2] ?
(~op[1] ?
(~op[0] ? B + C1 + (u&C) : B - C1 - (u&C)) : // ADD, SUB
(~op[0] ? product[31:0] : quotient)) : // MUL, DIV
0;
The new one has a shorter max latency! Eg in the old code decoding DIV passed thru 11 multiplexers. In the new one passing through 4 multiplexers is enough for all instructions.
Sequential vs binary search.
Br Jörg
> Am 11.05.2022 um 17:52 schrieb Hellwig Geisse <hellwig.geisse at mni.thm.de>:
>
> Hi Jörg,
>
>> On Mi, 2022-05-11 at 16:27 +0200, Jörg wrote:
>>
>> My understanding of the RISC5 archiecture is that the „handshake“ of CPU and memory is done via
>> „adr“ and „codebus“
>> During one cycle IR is stable as it is in a register.
>> When the cycle starts, the decoding of IR needs some combinational delay to get the „adr“ lines
>> stable (typically the first half of the cycle). They got stable and are fed (through RISC5Top) to
>> the memory and because SRAM is fast enough the „codebus“ (output from memory and input to RISC5)
>> got stable BEFORE the next cycle starts and clocks the next IR in.
>> So, decoding and fetching are in the same cycle. I mean for „normal“ instruction not being LD/ST.
>>
>
> yes and no - you discuss the address-forming part correctly, but
> forgot the data path for "normal" arithmetic instructions (Fig 16.8
> of PO.Computer.pdf has an arrow pointing from IR to "decode" in
> the data path). So what you essentially say is: in every clock cycle
> a new instruction is fetched. I agree completely; the throughput is
> one instruction per clock cycle. But this is not the latency - the
> data path uses up another clock cycle to compute the result and write
> it into the destination register.
>
> I stand by my statement that we have a two-stage pipeline for "normal"
> arithmetic instructions. Things get much fuzzier when we turn our attention
> to branches. Indeed, after writing my last mail in the discussion with Paul,
> a nagging question remained: why isn't there a pipeline flush when branching
> occurs? A proper two-stage pipeline would have fetched an instruction which
> should not be executed; it must be nulled in the pipeline ("flush" the pipe).
> A quick look into the sources reveals why (and you describe it in your
> statements above): the current instruction is allowed to modify the "next"
> address of the control unit. The pipeline is effectively shortened by one
> stage. This trick is normally abhorred in pipeline design, as it lengthens
> the cycle time noticeably (already almost at the end of the cycle the address
> for the next instruction is changed, so it adds the time needed for decoding
> the branch instruction to the memory access time for the next instruction:
> sum instead of maximum). And this price is payed for every instruction - not
> only for branches.
>
> I think these findings (an instruction-dependent number of pipeline stages)
> explain the different standpoints in the foregoing discussion very well.
>
> All: Thanks for your "food for thought"!
>
> Best regards,
> Hellwig
> --
> Oberon at lists.inf.ethz.ch mailing list for ETH Oberon and related systems
> https://lists.inf.ethz.ch/mailman/listinfo/oberon
More information about the Oberon
mailing list