[Oberon] RISC5 implementation issues.

Thu Feb 18 23:58:57 CET 2016

So I have been thinking about this some more and decided to 
modify/update the design to remove all the concerns raised by Walter and 
Wojtek.

Just to recap, Walter's concern is that the clocks are generated using 
flip-flops and use logic fabric interconnect instead of dedicated 
clocking elements and pathways, and that all clocks should be generated 
by a DCM module instead (DCM = Digital Clock Manager).  Wojtek's concern 
is that there are unspecified timing relations between the 25MHz and the 
75MHz clock domains.

Both concerns are valid and in my opinion the correct way to fix both 
issues is to make the design completely synchronous.  This means that 
all clocked elements in the design (like flip-flops, memories etc.) 
should be clocked with a single clock signal, which in this case is the 
75MHz clock.  The CPU and I/O subsystem, which before was clocked by a 
separate 25MHz clock, are now also clocked by the 75MHz clock but are 
only enabled to be clocked on every third clock cycle.  This means that 
all "always @ (posedge clk)" statements have been changed to include "If 
(enable) ...", where "enable" is a signal that is true on every third 
clock cycle.  The asynchronous SRAM interface is also changed so that 
the write signal is asserted on the middle-third clock phase of the 
three clock CPU cycle.

While the Verilog changes to do this are very straight forward, one 
complication here is that the Xilinx ISE tool used to create the bit 
file for the FPGA do not understand that the CPU and I/O subsystem are 
only clocked on every third clock and will basically try to make the CPU 
run at 75MHz, and will fail since this is too fast the FPGA.  The 
solution to this problem is to tell the tool that all clock paths in the 
CPU and I/O subsystem can actually take three clocks to complete (this 
is called multi-cycle paths). With the multi-cycle paths added to the 
.ucf file the design compiles with no timing violations

With those changes the 75MHz clock is now generated by a DCM and the 
unspecified timing relations that Wojtek brought up are now gone since 
everything is clocked with a single clock.   The modified design have 
been tested on Pepino and seems to run fine.

The complete ISE project with those changes are available at the Pepino 
GitHub repository: 
https://github.com/Saanlima/Pepino/tree/master/Projects/RISC5Verilog_Pepino

Any comments or critique are welcome.

Cheers,
Magnus

On 2/17/2016 2:16 PM, Walter Gallegos wrote:
> Hi Magnus,
>
> You are welcome to continue with FPGA specific topics by private 
> e-mail if you want.
>
> Regards
> Walter
>
> El 2016-02-17 a las 18:30, Magnus Karlsson escribió:
>> Hi Walter,
>>
>> Since this is really Paul's design, I guess it would be more 
>> appropriate to discuss it with him, I was just trying to explain why 
>> it looks like it does.
>>
>> Cheers,
>> Magnus
>>
>>
>> On 2/17/2016 1:15 PM, Walter Gallegos wrote:
>>> Magnus,
>>>
>>> Some of messages was delayed; so, I continue from here to not 
>>> overload the list.
>>>
>>> If I understand you correctly, you justify a uncontrolled delay 
>>> because they simplify the SRAM handling.
>>> Sorry, is as using the old circuit with an and/inverted to generate 
>>> a pulse. If you need a delayed signal you should use the DCM 90°, 
>>> 180° or 270° clock outputs and keep all under control, I think don't 
>>> need a state machine in this case.
>>>
>>> About ISE warnings, be careful, non warning do not means good 
>>> methodology.
>>>
>>> About XILINX docs; really, I don't remember. Doing training, first 
>>> as Xilinx ATP and now as independent consultant, I touch this 
>>> problem in my trainings. Have an uncontrolled delay in clock is a 
>>> big door to random problems. FPGA design must be synchronous all 
>>> times; no exceptions.
>>>
>>> Regards,
>>> Walter
>>>
>>>
>>> El 2016-02-17 a las 14:41, Magnus Karlsson escribió:
>>>> Walter,
>>>>
>>>> I agree with you that the "pure" way of doing this is as you 
>>>> stated, with a DCM to directly generate both clk and pclk. So how 
>>>> come Paul didn't do that?  It's not like he doesn't know how to use 
>>>> the DCM, after all the current code generates pclk from clk using a 
>>>> DCM, and there would probably be less code to do it like you 
>>>> suggest.  No, the reason for this is very subtle and is easy to 
>>>> miss if you just take a quick look at the code, and it has to the 
>>>> asynchronous SRAM interface.
>>>>
>>>> One of the most critical aspects of using SRAM is to control the 
>>>> write signal - ideally the write signal should be asserted after 
>>>> all other control signals (like address, data, byte-enable, read, 
>>>> oe) are valid, and should be de-asserted well before any of the 
>>>> other control signals go invalid, to avoid spurious writes. 
>>>> However, this is not that easy to do in a synchronous system where 
>>>> all signals change at the clock edge.  The most common way to do 
>>>> this is to have a state machine that is clocked at say 4x the CPU 
>>>> clock so that you can divided the SRAM access cycle into several 
>>>> phases and assert the write signal on some of those phases.
>>>>
>>>> However, this is not the way Paul choose to do it, instead he 
>>>> choose to do a less "pure" clock generation by generating clk from 
>>>> a flip-flop rather than from a DCM.  By doing so, he actually 
>>>> generates an early version of the clock signal called clk that is 
>>>> leading the global clock signal clk_BUFG by the delay of the BUFG 
>>>> buffer.  Since this early version of the clock signal is generated 
>>>> like any other logic signal, he could use this signal to gate the 
>>>> write signal to the SRAM such that write signal will be de-asserted 
>>>> well before the other control signals (clocked by clk_BUFG) will 
>>>> change, and thus avoiding the need to have a state machine 
>>>> controlling the write signal.  The price for this is that the clock 
>>>> signal is now generated in a less "pure" way, but still a valid way 
>>>> as long as you know what you are doing. The BUFG clock driver can 
>>>> be driven from a PLL, a DCM or from the logic fabric.  The first 
>>>> two are speed optimized paths going directly from the PLL or DCM to 
>>>> the BUFG and can be clock at much higher clock rate, while the 
>>>> logic fabric path is limited by the maximum clock rate of the logic 
>>>> fabric.  However, at the clock rate we use (25 MHz) this is not an 
>>>> issue.  When you do this there are no warnings generated by ISE 
>>>> that this is not a good idea, and I have not read anywhere in the 
>>>> Xilinx clocking resource guide that you should avoid doing this. 
>>>> Basically, the BUFG clock driver is designed to do this, the tool 
>>>> will allow you to do it and at the clock rate we use it has no 
>>>> performance implications.  As I see it, this is another place where 
>>>> the goal of simplification has driven the implementation of the 
>>>> system at the expense of a slightly less "pure" clock generation.
>>>>
>>>> Just my 2c
>>>>
>>>> Magnus