[Oberon] Fast version of Oberon RISC5 for Pepino

Tue Mar 1 01:16:57 CET 2016

On 2/29/2016 4:00 PM, Magnus Karlsson wrote:

Sorry, I meant 32 clocks for the multiply, not 16 clocks.

Magnus

> The current RISC5 verilog code does not take advantage of the fact the 
> both Spartan3 and Spartan6 have hardware multipliers but instead does 
> the multiply by doing 32 additions, which will cause the CPU to stall 
> for 32 clocks for each multiplication.
>
> As a test I rewrote the multiplier code so that the built-in hardware 
> multipliers are used instead, and with this code the CPU is only 
> stalled for 1 clock instead of 32 clocks.  The code uses the 
> verilog2001 syntax to specify a signed and an unsigned multiplier, and 
> the u input signal determines if the unsigned or signed result is 
> used.  This make the code in my opinion easier to understand compared 
> to the current adder-based code.
>
> Here is the code:
>
> `timescale 1ns / 1ps   // MK 29.2.2016
>
> module Multiplier(
>   input clk, run, u,
>   output stall,
>   input [31:0] x, y,
>   output [63:0] z);
>
> wire [63:0] z_signed, z_unsigned;
> reg [63:0] P;
> reg S;
>
> assign z = P;
> assign stall = run & ~S;
>
> mult_signed (.x(x), .y(y), .z(z_signed));
> mult_unsigned (.x(x), .y(y), .z(z_unsigned));
>
> always @ (posedge(clk)) begin
>   P <= u ? z_unsigned : z_signed;
>   S <= run;
> end
>
> endmodule
>
> module mult_signed (
>   input signed [31:0] x,
>   input signed [31:0] y,
>   output signed [63:0] z);
>
> assign z = x * y;
>
> endmodule
>
> module mult_unsigned (
>   input [31:0] x,
>   input [31:0] y,
>   output [63:0] z);
>
> assign z = x * y;
>
> endmodule
>
>
> This version of the code have succesfully been tested on Pepino board.
>
> Cheers,
> Magnus
>
>
>
> On 2/26/2016 9:59 AM, Magnus Karlsson wrote:
>> One outcome of the discussion about the Oberon RISC5 verilog code is 
>> that I did a deeper study about the clock limits for the project and 
>> found that the RISC5 CPU in itself can be clocked at up to about 66 
>> MHz but the external SRAM path is too slow for that speed (read is 
>> the problem).  The asynchronous nature of the SRAM interface makes it 
>> hard to constrain the ISE compiler to work hard on this path.
>>
>> I did trace the SRAM read data path and found that it takes about 10 
>> ns from the SRAM data input pins to the Z register bit (this is the 
>> longest path).  The address output path is about 5 nS and with a 10 
>> nS SRAM access time the fastest system clock cycle should be around 
>> 25 nS.
>>
>> To test this out I created a version of the code that runs the CPU at 
>> 37.5 MHz (26.666 nS) instead of 25 MHz, i.e. the CPU is running 1/2 
>> the video clock rate instead of 1/3,  and it seems to run fine of 
>> both the LX9 and the LX25 version of Pepino.  All timing constants 
>> (UART Rx, UART Tx, SPI and millisecond timer) have been changed to 
>> reflect the 50% faster clock rate.
>>
>> If anyone want to try it, the project (including bit files) is 
>> available here:
>> https://github.com/Saanlima/Pepino/tree/master/Projects/RISC5Verilog_Pepino_fast 
>>
>>
>> Cheers,
>> Magnus
>> -- 
>> Oberon at lists.inf.ethz.ch mailing list for ETH Oberon and related systems
>> https://lists.inf.ethz.ch/mailman/listinfo/oberon
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.inf.ethz.ch/pipermail/oberon/attachments/20160229/12bd4d94/attachment.html>