The clock is the most important and special signal of the entire circuit. Most devices in the system are operated on the hop-on-line of the clock, which requires that the delay deviation of the clock signal be very small, otherwise, the timing logic status may be wrong. Therefore, it is very important to clarify the factors that determine the system clock in FPGA design and minimize the latency of the clock to ensure the stability of the design.
1.1 creation time and Retention Time
Tsu: set up time refers to the time required for data to be unstable before the clock arrives, if the created time does not meet the requirements, the data cannot be steadily pushed into the trigger on the rising edge of the clock. the holding time (TH: hold time) is the time after the data is stabilized, if the retention time does not meet the requirements, the data cannot be stably written into the trigger. The establishment and retention time are as follows: 1.
Figure 1 retention time and Creation Time
The same module designed by FPGA usually contains the combination logic and time series logic. To ensure that the data at these logic interfaces can be processed stably, therefore, it is very important to establish a clear concept of time and time. The following are some questions about the concept of building time and holding time.
Figure 2 a basic model in synchronous design
Figure 2 shows a basic model for a unified synchronization design using a clock. In the figure, TCO is the data output delay of the trigger, tdelay is the combination logic delay, tsetup is the trigger Creation Time, and tPD is the clock delay. If the maximum creation time of the first trigger d1 is T1max, the minimum value is T1min, the maximum latency of the combination logic is T2max, and the minimum value is T2min. Ask the second trigger D2 what conditions should be met by the time t3 and the time t4, or know the maximum clock period allowed by T3 and T4. This problem must be considered in the design. Only by clarifying this problem can we ensure that the latency of the designed combination logic meets the requirements.
The following is an analysis by using a sequence chart: Set the input of the first trigger to D1, the output to Q1, the input of the second trigger to D2, and the output to Q2;
The clock is uniformly Sampled on the rising edge. In order to facilitate analysis, we discuss two situations: first, assuming that the latency TPD of the clock is zero, this situation is often met in FPGA design, in FPGA design, a unified system clock is generally used, that is, the clock input from the global clock pin, so that the delay of the internal clock is negligible. In this case, you do not need to consider the retention time, because every data is subject to a clock cycle and line delay, that is, the clock-based latency is far less than the data delay, therefore, the retention time can meet the requirements, with the focus on the Creation time. If the D2 creation time meets the requirements, the time sequence diagram should be shown in 3.
We can see that if:
T-TCO-tdelay> T3
That is: tdelay <T-Tco-T3
Therefore, it satisfies the time requirement, where T is the cycle of the clock. In this case, the second trigger can obtain D2 stably at the rising edge of the second clock, the time sequence is shown in figure 3.
Figure 3 Timing Diagram
If the delay of the combination logic is too large
T-TCO-tdelay
The second trigger will not be able to meet the requirements. The rising edge of the second clock will produce an indefinite state, as shown in figure 4. The circuit will not work normally.
Figure 4 the timing sequence of the combination logic is too large to meet the requirements
So that we can launch
T-Tco-T2max> = T3
This is the required D2 establishment time.
From the time sequence diagram, we can see that the establishment time and retention time of D2 have nothing to do with the establishment and retention time of D1, it is also an important conclusion that only the combination logic before D2 is related to the data transmission delay of D1. The delay has no superposition effect.
In the second case, if the clock has a delay, you need to consider the retention time and the establishment time. The asynchronous clock design method is used for clock delay, which is difficult to ensure data synchronization. Therefore, it is rarely used in actual design. In this case, if both the creation time and retention time meet the requirements, the output time series is shown in step 5.
Figure 5 time delay but time sequence satisfied
From Figure 5, we can easily see that the TPD setting time is relaxed, so the D2 setting time must meet the requirements:
TPD + T-Tco-T2max> = T3
Since the sum of time and retention time is a stable clock cycle, if the clock has a delay and the data delay is also small, the establishment time must increase, the retention time decreases accordingly. If it is reduced to the point where the D2 retention time requirement is not met, the correct data cannot be collected, as shown in figure 6.
T-(Tpd-Tco-T2min)
T-(TPD + T-Tco-T2min)> = T4 that is TCO + T2min-Tpd> = T4
From the above formula, we can see that if TPD = 0, that is, the clock delay is 0, the TCO + T2min> T4 is also required, however, in practical applications, because the latency of T2, that is, the line latency, is much longer than the trigger retention time, that is, T4, it is unnecessary to maintain the relationship.
Figure 6 the clock has a delay and the retention time does not meet the requirements
To sum up, if you do not consider the delay of the clock, you only need to focus on setting up the time. If you consider the delay of the clock, you need to care more about the retention time. Next we will analyze how to improve the clock in the synchronization system in FPGA design.
1.2 How to Improve the working clock in the synchronization system
From the above analysis, we can see that the requirements for D2 establishment time t3 during synchronization systems are:
T-Tco-T2max> = T3
Therefore, it is easy to launch T> = T3 + TCO + T2max, where T3 is the establishment time tset of D2, and T2 is the delay of the combination logic. In a design, both T3 and TCO are fixed values determined by the device, and only T2 is controllable, therefore, reduce T2 as much as possible to increase the system clock. In order to reduce T2, the following methods can be used in the design.
1.2.1 reduce latency by changing the cabling Mode
Taking the Altera device as an example, we can see many blocks in the timing closure floorplan in Quartus. We can split blocks by row and by column. Each block represents one lab, each lab contains 8 or 10 le. The relationship between their cabling latency is as follows: In the same lab (the fastest) <same column or same row <different rows and different columns. We can add appropriate constraints to the synthesizer (the constraint must be appropriate. Generally, it is more appropriate to add a 5% margin. For example, if the circuit runs at 100 MHz, add the constraint to MHz, the effect of excessive constraints is not good, and the overall time is greatly increased) You can deploy the relevant logic closer to a point when wiring, thus reducing the delay of cabling.
1.2.2 reduce the Latency by splitting the combined Logic
Generally, the synchronous circuit has more than the first-level lock memory (8). To make the circuit work stably, the clock cycle must meet the maximum delay requirement and shorten the longest delay path to increase the circuit working frequency. 7. We can break down a large combination logic into smaller parts and insert a trigger in the middle to increase the frequency of the circuit. This is also the basic principle of the so-called pipeline technology.
For the upper part of figure 8, the clock frequency is subject to the latency of the second large combination logic, and the average allocation of the combination logic through appropriate methods, it can avoid excessive latency between two triggers and eliminate the speed bottleneck.
Figure 7 splitting and combination logic
Figure 8 transfer combination logic
So how to split the combination logic in the design? A better method should be accumulated in practice, but some good design ideas and methods should also be mastered. We know that most FPGAs are currently based on 4-input LUT. If the judgment condition of an output is greater than 4-input, multiple LUT-level connections are required, in this way, the first-level combination logic latency is introduced. We need to reduce the combination logic, but we need to input as few conditions as possible, so that we can cascade less luts, this reduces the latency caused by the combination logic.
We usually hear that the stream is a combination logic that is large by cutting (insert a level-1 or multi-level D trigger in it to reduce the combination logic between registers) to improve the working frequency. For example, if a 32-bit counter has a long carry chain, it will inevitably reduce the operating frequency. We can divide it into 4-bit and 8-bit counts, every time a four-digit counter is counted to 15, an eight-digit counter is triggered. This way, the counter is cut and the working frequency is increased.
In a state machine, large counters must also be moved out of the state machine, because counters are usually greater than 4 inputs, if it is used together with other conditions as the State jump criterion, it will inevitably increase the cascading of LUT, thus increasing the combination logic. Taking a counter with 6 inputs as an example, we originally hoped that the status would change after the counter was counted as 111100. Now we put the counter outside the state machine, when the counter reaches 111011, an enable signal is generated to trigger the status jump, which reduces the combination logic. A state machine generally contains three modules: An output module, a module that determines the next state, and a module that stores the current state. The logic used to form the three modules is also different. The output module usually consists of both the combination logic and the time sequence logic. The module determines the next state and the combination logic. The current State is usually saved by the time sequence logic. The relationship between the three modules is shown in Figure 9.
Figure 9 Composition of a State Machine
Generally, the state machine is written into three parts according to the three modules. Below is a good method for designing the state machine:
/*-----------------------------------------------------
This is FSM demo program
Design name: arbiter
File Name: arbiter2.v
-----------------------------------------------------*/
Module arbiter2 (
Clock, // clock
Reset, // active high, Syn Reset
Req_0, // request 0
Req_1, // request 1
Gnt_0,
Gnt_1 );
// ------------- Input ports -----------------------------
Input clock;
Input reset;
Input req_0;
Input req_1;
// ------------- Output ports ----------------------------
Output gnt_0;
Output gnt_1;
// ------------- Input ports data type -------------------
Wire clock;
Wire reset;
Wire req_0;
Wire req_1;
// ------------- Output ports data type ------------------
Reg gnt_0;
Reg gnt_1;
// ------------- Internal constants --------------------------
Parameter size = 3;
Parameter idle = 3'b001,
Gnt0 = 3 'b010,
Gnt1 = 3'b100;
// ------------- Internal variables ---------------------------
Reg [size-1: 0] State; // seq part of the FSM
Wire [size-1: 0] next_state; // combo part of FSM
// ---------- Code startes here ------------------------
Assign next_state = fsm_function (req_0, req_1 );
Function [size-1: 0] fsm_function;
Input req_0;
Input req_1;
Case (state)
Idle: If (req_0 = 1 'b1)
Fsm_function = gnt0;
Else if (req_1 = 1 'b1)
Fsm_function = gnt1;
Else
Fsm_function = idle;
Gnt0: If (req_0 = 1 'b1)
Fsm_function = gnt0;
Else
Fsm_function = idle;
Gnt1: If (req_1 = 1 'b1)
Fsm_function = gnt1;
Else
Fsm_function = idle;
Default: fsm_function = idle;
Endcase
Endfunction
Always @ (posedge clock)
Begin
If (reset = 1 'b1)
State <= idle;
Else
State <= next_state;
End
// ---------- Output logic -----------------------------
Always @ (posedge clock)
Begin
If (reset = 1 'b1) begin
Gnt_0 <= #1 1' B0;
Gnt_1 <= #1 1' B0;
End
Else begin
Case (state)
Idle: Begin
Gnt_0 <= #1 1' B0;
Gnt_1 <= #1 1' B0;
End
Gnt0: Begin
Gnt_0 <= #1 'b1;
Gnt_1 <= #1 1' B0;
End
Gnt1: Begin
Gnt_0 <= #1 1' B0;
Gnt_1 <= #1 1' B1;
End
Default: Begin
Gnt_0 <= #1 1' B0;
Gnt_1 <= #1 1' B0;
End
Endcase
End
End // end of block output _
Endmodule
The state machine is usually written into three segments, so as to avoid a large combination of logic.
All of the above is the case where the logic of the combination can be cut by means of flow, but in some cases it is difficult for us to cut the logic of the combination. What should we do in these cases?
The state machine is such an example. We cannot add a stream to the State decoding combination logic. If there is a state machine with dozens of States in our design, its state decoding logic will be very huge, without a doubt, this is very likely to be the key path in the design. What should we do? The old idea is to reduce the combination logic. We can analyze the status output, reclassify them, and redefine them as a group of small state machines based on this, by selecting the input (case statement) and trigger the corresponding small state machine, thus implementing the large state machine into a small state machine. In the ata6 specification (hard disk standard), there are about 20 types of input commands, and each command corresponds to many States. If a large state machine (State set) is used) it is unimaginable to do this. We can use the case statement to decode the command and trigger the corresponding state machine. In this way, the frequency of this module can be relatively high.
Conclusion: the essence of Improving the operating frequency is to reduce the latency from registers to registers. The most effective method is to avoid the emergence of large combination logic, that is, to satisfy the four input conditions as much as possible, reduces the number of LUT cascade operations. We can increase the working frequency by adding constraints, flow, and cutting states.
Pay attention to the following points when designing the clock in FPGA:
1. Try to use only one clock for a module. One module here refers to a module or an entity. In the design of multiple clock domains involving cross-clock domains, it is best to have a dedicated module for clock domain separation. In this way, the synthesizer can generate better results.
2. do not use a clock-gate unless it is a low-power design-This increases design instability where a clock-gate is needed, the gate signal should also be flushed along the clock and then output to the clock phase.
3. do not use the signal after the counter is divided into other modules to clock, but use the clock enabling method, otherwise this way of clock flying all over the sky is extremely unfavorable to the design reliability, it also greatly increases the complexity of static time series analysis.
1.4 synchronization between different clock domains
When two modules in a design use two working clocks respectively, they work in asynchronous mode at their interfaces, in this case, two modules must be synchronized to ensure correct data processing.
Different clock domains usually have the following two situations:
1. The two clocks have different frequencies;
2. Although the two clocks have the same frequency, they are two independent clocks, and their phases are irrelevant.
The two figures are as follows:
Figure 10 the frequencies of the two clocks are completely different
Figure 11 the two clocks have the same frequency but the phase is irrelevant.
Data transmitted between two clock domains usually uses different Synchronization Methods Based on Different bit widths.
1. Single-Bit Synchronization and each pulse sent has at least one cycle width
This type of synchronization is mainly used for the synchronization of some control signals. The common method is to use two triggers in the receiving module to use the system clock for two beats, as shown in Figure 12. Note the following points for such synchronization.
Figure 12 design of a synchronization device
(1) In Figure 12, the synchronous circuit is actually called "one-bit synchronization". It can only be used to synchronize one asynchronous signal, and the width of the signal must be greater than the pulse width of the current clock, otherwise, the asynchronous signal may not be obtained at all.
(2) Why is the synchronous circuit in Figure 1 only used to synchronize an asynchronous signal? (A) When two or more asynchronous signals (control or address) enter the current time domain to control the current time domain circuit, if these signals are synchronized using the synchronous circuit shown in Figure 12 respectively, two or more asynchronous signals (control or address) may occur due to connection delay or other delays) A skew is generated between them. After the synchronization of This skew to the current time domain through the synchronization of the synced in figure 12, a large skew or competition will occur, resulting in an error in the current time domain circuit.
As shown in Figure 13:
Figure 13 An error occurred while synchronizing multiple control signals
(B) If the asynchronous data bus needs to enter the local time domain, the circuit in Figure 12 cannot be used because the data changes are random, the width of 0 or the width of 1 is not related to the time-domain clock pulse, so the circuit in Figure 12 may not obtain the correct data.
(3) Note that the second trigger does not prevent the occurrence of the sub-steady state. Specifically, this circuit can prevent the spread of the sub-steady state. That is to say, once the first trigger has a sub-steady state (possibility exists), the sub-Steady State will not be transmitted to the circuit after the second trigger.
(4) The first-level trigger has a sub-steady state, and a recovery time is required to stabilize it, or to exit the sub-steady state. When the recovery time plus the second-level trigger creation time (more precise, but also minus the clock skew) is less than or equal to the clock cycle (this condition is easy to meet, generally, the two-level trigger should be as close as possible. There is no combination logic in the middle, and the skew of the clock is small.) The second-level trigger can stably sample and obtain stable and definite data, it prevents the spread of sub-steady state.
(5) ff2 samples the output of ff1. Of course, ff1 outputs and ff2 outputs. The delay is only one period. Note: The reason why the Sub-steady state is called the sub-steady state is that once ff1 enters, its output level may be variable and may be correct or wrong. Therefore, it must be noted that although this method can prevent the spread of the sub-steady state, it cannot ensure that the data after the two-level trigger is correct. Therefore, this type of circuit has a certain amount of error level data, so it is only applicable to a small number of places that are not sensitive to errors. For sensitive circuits, dual-port RAM or FIFO can be used.
2. The input pulse may be less than a Synchronization Circuit with a clock cycle width.
Normally, 14 feedback circuit is used for the case of 2. The analysis of this circuit is as follows: assume that the input data is high, because the first trigger ff1 is high-level erasing, all outputs are also high-level, correct. If the input is Level 1, ff1 is forcibly cleared. At this time, the output bit is zero. This ensures the correctness of the output.
Figure 14 the input pulse may be less than a Synchronization Circuit with a clock cycle width
For details about how to control multiple signals, refer to the detailed analysis: the comprehensive design of the multi-hour clock system in www.fpga.com.cnis so skillful.