[Oberon] ALU 2015 and 2018

Wed May 11 19:23:15 CEST 2022

Woytek

I agree that this old Verilog code
aluRes =  
 MOV ? C1 :  
 LSL ? t3 :  
 ASR ? s3 :   
 ROR ? s3 :  
 AND ? B & C1 :  
 ANN ? B & ~C1 : 
  IOR  ? B | C1 :   
  XOR ? B ^ C1 :   
  ADD ? B + C1 + (u & C) :
  SUB ? B - C1 – (u & C) :  
   MUL ? product[31:0] :  
   DIV ? quotient : 0; 

looks nicer than this new one
aluRes =
  ~op[3] ?
    (~op[2] ?
      (~op[1] ?
        (~op[0] ? 
          (q ?  // MOV
            (~u ? {{16{v}}, imm} : {imm, 16'b0}) :
            (~u ? C0 : (~v ? H : {N, Z, C, OV, 20'b0, 8'h53}))) :
          lshout) :  //  LSL
        rshout) : //  ASR, ROR
      (~op[1] ?
        (~op[0] ? B & C1 : B & ~C1) :  // AND, ANN
        (~op[0] ? B | C1 : B ^ C1))) : // IOR. XOR
    (~op[2] ?
       (~op[1] ?
          (~op[0] ? B + C1 + (u&C) : B - C1 - (u&C)) :   // ADD, SUB
           (~op[0] ? product[31:0] : quotient)) :  // MUL, DIV
    0;
The new one has a shorter max latency! Eg in the old code decoding DIV passed thru 11 multiplexers. In the new one passing through 4 multiplexers is enough for all instructions.
Sequential vs binary search.
Br Jörg

> Am 11.05.2022 um 17:52 schrieb Hellwig Geisse <hellwig.geisse at mni.thm.de>:
> 
> Hi Jörg,
> 
>> On Mi, 2022-05-11 at 16:27 +0200, Jörg wrote:
>> 
>> My understanding of the RISC5 archiecture is that the „handshake“ of CPU and memory is done via
>> „adr“ and „codebus“
>> During one cycle IR is stable as it is in a register.
>> When the cycle starts, the decoding of IR needs some combinational delay to get the „adr“ lines
>> stable (typically the first half of the cycle). They got stable and are fed (through RISC5Top) to
>> the memory and because SRAM is fast enough the „codebus“ (output from memory and input to RISC5)
>> got stable BEFORE the next cycle starts and clocks the next IR in.
>> So, decoding and fetching are in the same cycle. I mean for „normal“ instruction not being LD/ST.
>> 
> 
> yes and no - you discuss the address-forming part correctly, but
> forgot the data path for "normal" arithmetic instructions (Fig 16.8
> of PO.Computer.pdf has an arrow pointing from IR to "decode" in
> the data path). So what you essentially say is: in every clock cycle
> a new instruction is fetched. I agree completely; the throughput is
> one instruction per clock cycle. But this is not the latency - the
> data path uses up another clock cycle to compute the result and write
> it into the destination register.
> 
> I stand by my statement that we have a two-stage pipeline for "normal"
> arithmetic instructions. Things get much fuzzier when we turn our attention
> to branches. Indeed, after writing my last mail in the discussion with Paul,
> a nagging question remained: why isn't there a pipeline flush when branching
> occurs? A proper two-stage pipeline would have fetched an instruction which
> should not be executed; it must be nulled in the pipeline ("flush" the pipe).
> A quick look into the sources reveals why (and you describe it in your
> statements above): the current instruction is allowed to modify the "next"
> address of the control unit. The pipeline is effectively shortened by one
> stage. This trick is normally abhorred in pipeline design, as it lengthens
> the cycle time noticeably (already almost at the end of the cycle the address
> for the next instruction is changed, so it adds the time needed for decoding
> the branch instruction to the memory access time for the next instruction:
> sum instead of maximum). And this price is payed for every instruction - not
> only for branches.
> 
> I think these findings (an instruction-dependent number of pipeline stages)
> explain the different standpoints in the foregoing discussion very well.
> 
> All: Thanks for your "food for thought"!
> 
> Best regards,
> Hellwig
> --
> Oberon at lists.inf.ethz.ch mailing list for ETH Oberon and related systems
> https://lists.inf.ethz.ch/mailman/listinfo/oberon