Introduction to Digital IC II (Simple algorithm architecture)

Source: Internet
Author: User
Tags arithmetic

The core of this section is to focus on "the brain has a clear circuit framework, and then use Verilog concise expression" to carry out, although the digital circuit because of its stability can be used in software design form to carry out circuit design, but it and software design has an essential difference, "Verilog HDL advanced Digital Design" Algorithms and architectures for digital signal processing that piece has an example, a halftone pixel image processor, strongly recommend this book to everyone, if you have seen but did not look thoroughly, I hope to be able to see again and again, this book can be said to be at present I have seen the form of digital design textbooks the best one, although my view is not many, But really this book is very important, especially the above mentioned in the Digital Signal processing algorithm architecture, is to a higher level (algorithm and architecture) bridge, halftone image processor, do not be frightened by the name, thought must understand what image processing these advanced concepts, in fact, you do not control what pixels, But to see how he translates the algorithm for nested descriptions into circuits after performance and area tradeoffs. If you have not read this book, probably to me to say the halftone, what algorithm architecture feel very confused, do not worry, this paragraph is want to express "Verilog HDL Advanced Digital Design" this book of importance. I will use some simple examples to talk about the objectives to be achieved and the materials recommended therein.

Before and during the writing of the Verilog code (the design code is not the validation code), or after the simulation or even synthesis, the digital designer's brain should always have a clear-circuit framework, in order to more accurately describe this statement, I purposely simplified the above example of the halftone image pixel image processor by 70% or 80% and described below, and let's look at the key to digital design. As we analyze, you may come to realize that digital design is a tradeoff between performance, area, and power consumption, so it's clear what your goals are, and this time I'll recommend tutorials to tell you which books can help you get to the quick steps:

First the algorithm designer gave us the following code (written in C), let's go with the digital system:

  

All of the model names in the above algorithm are named K, but we can distinguish between input and input, and I and O, respectively, are as follows:

Note that I named for the module input, and o named for the module output, the output of the module or in the next step directly as the input of the next module or is to be registered, we do not care, but at the same time will be consistent with the above algorithm values simultaneously output can, have to mention behavior synthesis, Behavior synthesis can automatically integrate the above behavior description into RTL code, but the problem is that there is no mature tool, we get this piece of code, we will expand its for loop, draw the following direction-free graph:

  

In the following narrative, I use the vernacular as much as possible, because after all, it is a point of view, it is too formal to achieve the purpose, we look at the above figure, this figure is from the above nested for loop, where the yellow input, orange for the output, according to the index inside the For loop, We also use a column in the diagram, where I or O is a combination of the number of rows and columns, for example, o22 means that this is an output node, it is in the second row of the second column, the front row is the columns behind, then the line above the 123 represents what? The number of multiplication items in the For loop ah, if you carefully control the for loop and we draw the diagram, want to prove that the two may also need me to explain what this orange thing is a thing, good, then I would like to expand this orange thing to show you what this means, we take o12 as an example:

The small blue circle represents a multiplier, that is, the line with arrows in the upper left-hand side (the term is called a forward edge) is the number of times the signal passes through the line, for example, I01 starts at 3, then i01 moves in the direction of the arrow from the beginning 3 to 6. The three arrows in the image above point to the o12, that is, i01*2, i02*3, and o11 multiplied by 1 all flow into the O12 node, the O12 node will operate on these three signals, the specific operation is to use two adder to add this inflow node three numbers, to get output o12, So the above left o12 can be thought to have two meaning, the first is to think of it as a node, it is only an operation, this operation is to add three signals into 22, the other means it is a signal o12, it itself as a variable to the next level of operation, as the input of the next level of operation, The o12 by default is to say that it is a signal, that is, to take the second level of meaning. Okay, so far, you've figured out how to get a direction-free graph from a nested for loop, so I'll copy the above-mentioned non-circular graph below, and we'll start our discussion from the following figure:

Given the direction-free graph shown above, then how to design the circuit? Can stop here to think about, not to look down, do not think what has to do without a map how abstruse, is a signal flow chart.

Well, perhaps you have thought of it, because the operation of the line is multiplied by 1, 2, or 3, then you use a multiplier in each line, and each orange to add three numbers, then each orange part with two adder instead, then put this to the loop-free diagram with the circuit implementation is not it, Corresponding hardware diagram I painted it down there, like you think?

The yellow part of the original input, and each light blue part is the output, this circuit is a pure combination of logic, its advantages are easy to see, that is it must be the fastest, given the input, the circuit began to operate, after a certain delay, the output results, no clock cycle division, according to data dependence, It starts from the upper left corner, gradually from the upper left corner to the lower right corner of the operation (do not worry about data dependencies, the subsequent analysis will mention), well, now you have the mind of the above this hardware framework diagram, once the hardware frame diagram in your brain formation, Verilog code is to put the figure in your mind in the Verilog language "say" out, to whom to listen? "Say" to the verifier to listen, "say" to the general Staff listen, good, according to the above framework to write the following code:

Then in the top layer to instantiate 9 times, then the given input can get 9 output, this is a small example, helpless is the actual design for loop description is quite complex, and the bit width is often very large, such as in this example, if the matrix is 100*100, it is true that the design speed is the fastest, But calculate how many hardware resources you need, assuming that the input is n bit width, each point requires two adder (at least 2N bit width), three multipliers (2N bit width), then 100*100 you need 20,000 2N bit width adder, 30,000 2N bit wide multiplier, OK, wait for you to produce out of the loss of it, So the performance is very good, but the area beyond the ability to withstand, even exaggerated design is not feasible at all.

This time you will think: Since the area is not good, then I will reduce the area to the extreme, so you are smart to find that 9*9 array operation each is the same, you think of the use of one such unit, and then use the controller in each clock cycle of the operation? Because the 9*9 array operation (the light blue part of the diagram) is the same form, but the signal is different, if you are in any of the nodes will find that you need to do is to put your left, top left and the number of the top of the difference multiply together, it is easy to see, perhaps it is difficult to understand how the controller to do, Follow-up will have a special controller introduction, do not worry, we assume that the controller can do this, and you do not worry about whether the controller itself will be very large, or very slow, the answer is no, this rest assured that the logic of the controller will generally be much smaller than the data path.

Imagine that there is now only one of the calculated units shown (2 adder and 3 multipliers)

Like what:

First cycle: Calculates o11 based on i00, I01, i10

Second cycle: Calculate o12 based on O11, I01, i02

... ...

Last cycle: Calculates o33 based on O32, O22, o23

You will find that the area of this writing will be small, can be said to be minimal, regardless of your original for loop index, that is, no matter how large your array, only the controller and memory will increase, but you need to know that the memory to add a storage unit its area will not be greatly increased, Even if the SRAM of the six-tube logic does not have 2 adder and 3 multiplier This increase exaggeration, so finally you get the smallest area of the design, if the array listed as 100*100, you design, when asked to design the performance, you answer said 10,000 cycles, I think no customer will be satisfied, so, The smallest area leads to an exaggerated performance disadvantage. So below we want to find the performance and area of the balance point, that is to weigh the area and performance to find a compromise design method, digital circuit compromise need to consider the variables are not many, not like in the simulation design need to consider the eight-edge shape rule, linearity noise, such as eight variables.

Before we make a compromise, we need to recognize the fact that the data dependencies and the data dependencies in the above-mentioned graph are o21, and for example, the calculation needs i20 and i10 and O11, and when the next three results are calculated, O21 can calculate the result, because the actual hardware is delayed, Can not be given the input immediately get output, so the calculation of o21 results is not immediately, can only be calculated after I20 and i10 and o11 calculation, this phenomenon is like the bank queue, before we go to the bank to withdraw money when the queue of an old long team, you can only be distracted in the team waiting for what not to do, Time is wasted, so now the bank to withdraw money is you take a number, when the broadcast system will notify you when the past processing business, then only one person at a time, the long team is not, and the broadcast system is a controller, it tells you when it's your turn, OK, let us back to this case, I'll draw a diagram below to represent the data dependencies:

It can be seen that when a given circuit raw input, that is, the yellow square part, then o11 (that is, white), the 9*9 array is only it can be computed, you want to calculate o12, that is impossible, because the o11 results have not been calculated, in the mark out, the first calculation of white, when the white calculation is complete, Calculates the red, then calculates the gray, then the green is finally blue, so, a tradeoff of the area and performance of the design idea appeared, we only use 3 units (so-called computing unit is the above figure in the calculation node, that is, 3 multipliers and 2 adder), why three instead of 2, Because if it is 2, then when you calculate the gray part of the time can not be completed in a cycle, the need for an additional two cycles, such a calculation period inconsistency is likely to lead to the complexity of the controller, so there are 3 operating units, then this compromise design is how to operate it?

Always remember that there are four independent operating units in our circuit, good, the first period, the controller put i00, i01, i10 to the 3 computing units of the random one to calculate the results, you will ask, the other 2 how to do, the other 2 without tube, their input arbitrarily specified, The controller knows which efficient unit it uses and will go to that cell and take the results out of the memory, and then the next cycle, the controller takes i01, I02, o11 out to the input of an arithmetic unit, and takes out i10, I20, and o11 into the input of the second unit of operation, As for the other 1 arithmetic units, it is not necessary to take care of the calculation of the selected two arithmetic units in parallel. The following operation is similar, I hope I have the whole operation is clear, then the overall design of the area? There are 3 arithmetic units (not 9), the performance is 4 clock cycles (not 9), this gap will be obvious after the design scale expands, you can verify it yourself, so we have the area and performance are acceptable final compromise design, below I will draw the entire structure diagram, This block diagram also appears in the brain as you write Verilog code.

Above is the entire final compromise design of the structural block diagram, the design of the data path is three computing units, its Verilog description is very simple, the controller for the next introduction, and memory if there is no dual-port SRAM can be replaced with registers, storage controller is also very simple, Its essence is an interface structure of data path and SRAM, so the whole Verilog description is not particularly difficult, the difficulty of writing Verilog is to coordinate various states and control signals.

Well, the simple architecture of the digital circuit here, the main recommended textbook for the "Verilog HDL Advanced Digital Design", this book can be said to be highly recommended, each of the examples have various versions of the structure, you can understand the balance between the various structures, of course, the book is more difficult, In particular, you have not been very skilled in the state machine, it is difficult, it does not matter, I will recommend how to learn the state machine teaching materials, those teaching materials can let you learn and proficiency, again, the purpose of this article is mainly recommended textbooks, and the rest of the writing is not too much thinking, which may be a major error, for reference only If you think the original book half-tone image processor is difficult to understand, the signal is complex, then you can properly look at my adaptation of this simple example and listen to my analysis, may make you learn "Verilog HDL Advanced Digital Design" easier.

Introduction to Digital IC II (Simple algorithm architecture)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.