The CPU kernel is mainly divided into two parts: the memory generator and the Controller.
(1) Inspector
1. Arithmetic Logic Operation Unit Alu (arithmetic and logic unit)
ALU performs Fixed-Point Arithmetic Operations (addition, subtraction, multiplication, division), logical operations (same as or not), and shift operations on binary data. In some CPUs, there is also a locator dedicated for processing shift operations.
Generally, Alu consists of two inputs and one output. Integer units are also called ieu (Integer Execution Unit ). What we usually call "CPU is XX bits" refers to the number of bits of data that ALU can process.
2. floating point unit FPU (Floating Point Unit)
FPU is mainly responsible for floating point operations and high-precision Integer operations. Some fpus also provide vector operations, while others have specialized vector processing units.
3. General Register Group
A general register group is the fastest memory group used to store the operands and intermediate results involved in the operation.
The Design of general-purpose registers is very different from that of CISC:
A) CISC registers are usually very few, mainly due to the limited hardware costs at that time. For example, the x86 Instruction Set has only eight universal storage devices. Therefore, the CPU execution of CISC is to access data in the memory most of the time, rather than in the register. This slows down the entire system.
B) but the Proteus system often has a lot of General registers, and uses overlapping register windows and register heap technologies to make full use of register resources.
C) For the disadvantage that x86 instruction sets only support 8 General registers, Intel and AMD's latest CPU adopt a technology called "register rename, this technology allows the x86cpu registers to break through eight limits, reaching 32 or more.
D) However, this technology requires an extra clock period for register operations to rename the registers.
4. Special Registers
A special register is usually a status register that cannot be changed by the program and is controlled by the CPU itself, indicating a certain state.
(2) Controller
The calculator can only perform operations, and the controller is used to control the entire CPU.
1. Command Controller:
The Command Controller is a very important part of the controller. It must take commands, analyze commands, and perform other operations, and then hand them over to the execution unit (ALU or FPU) for execution, it also forms the address of the next instruction.
2. Time series controller:
The time series controller is used to provide control signals for each command in chronological order. The time series controller includes the clock generator and frequency doubling definition unit. The clock generator sends a very stable pulse signal from the Z crystal oscillator, that is, the CPU clock speed; the frequency doubling definition Unit defines how many times the CPU clock speed is the memory frequency (bus frequency.
3. Bus Controller:
The bus controller is mainly used to control the internal and external bus of the CPU, including the address bus, data bus, and control bus.
4. Interrupt Controller:
The interrupt controller is used to control various interrupt requests and queues interrupt requests based on their priorities. These requests are sent to the CPU one by one.
(3) Design of CPU Core
What determines the CPU performance? A pure ALU speed does not play a decisive role in a CPU, because ALU speed is almost the same. The decisive factor in the performance of a CPU is the design of the CPU kernel.
1. superscalar)
Since the speed of ALU cannot be greatly improved, is there any alternative? The parallel processing method has once again played a powerful role. The so-called excessive CPU is the CPU that only integrates multiple ALU, multiple FPU, multiple decoders and multiple pipelines to improve performance in parallel.
The technology that exceeds the quota should be easy to understand, but it should be noted that you should not care about the number before "exceeding the quota", such as "9 exceeding the quota ", different manufacturers have different definitions for this number. More is just a commercial propaganda method.
2. Pipeline)
A specific instruction execution process can be divided into five parts: getting the instruction, decoding the instruction, getting the operand, calculating (ALU), and writing the result.
The first three steps are generally completed by the Command Controller, and the last two steps are completed by the supervisor.
In the traditional mode, all commands are executed in sequence. First, the command controller is used to complete the first three steps of the First Command, and then the worker is used to complete the next two steps, the first three steps to complete the second command are the timer and the last two steps after the second command. Obviously, when the command controller works, the timer is basically resting, however, when the timer is working, the Command Controller is still resting, resulting in a considerable waste of resources. The solution is easy to think of. When the Command Controller completes the first three steps of the first command, it directly starts the operation of the second command, and the operation unit is also. This forms an assembly line system, which is a two-level assembly line.
Assume that there are three command control units and two operation units in a system that exceeds the limit, then, the address of the second command can be directly started after the first command is obtained. In this case, the first command is decoded in the next line, the third command is obtained, and the second command is decoded, the first command takes the operand ...... This is a 5-level pipeline. Obviously, the average theoretical speed of a five-level pipeline is four times that of a non-pipeline.
The assembly line system maximizes the use of CPU resources so that each component can work in each clock cycle, greatly improving the efficiency. However, the pipeline has two major problems: Correlation and transfer.
In a pipeline system, if the result of the first command is required for the second command, this is related. Which of the above five pipelines is used as an example? When the second command needs to take the operand, the operation of the first command is not completed yet. If the second command then gets the operand, the error result is returned. Therefore, the entire assembly line had to pause and wait for the completion of the first command. This is a very annoying question, especially for long pipelines, such as the 20-level pause, which usually costs more than a dozen clock cycles. Currently, the solution to this problem is execution in disorder. The principle of execution in disorder is to insert irrelevant commands into two related commands to make the entire pipeline smooth. For example, in the preceding example, after the first command is executed, the third command is executed directly (assuming that the third command is irrelevant) before the second command is executed.
In this way, when the second command needs to take the operand, the first command is just completed, and the third command is about to be completed, the entire pipeline will not pause. Of course, the blocking of the pipeline still cannot be completely avoided, especially when there are many related commands.
Another major problem is conditional transfer. In the above example, if the first instruction is a conditional transfer instruction, then the system will not know which instruction should be executed below? In this case, you must wait for the judgment result of the First Command to execute the second command. The pipeline pause caused by conditional transfer is even more serious than the related one. Therefore, branch prediction technology is now used to handle the transfer issue. Although our program is full of branches, and which branch is possible, we always choose a branch in most cases. For example, the end of a loop is a branch. Except for the last loop we need to jump out, we always choose to continue the loop. Based on these principles, the branch prediction technology can predict what the next command is and execute it before the result is obtained. The current branch prediction technology can reach more than 90%
Correct rate. However, once the prediction is incorrect, the CPU still has to clean up the entire pipeline and return to the pivot point. This will lose a lot of clock cycles. Therefore, further improving the accuracy of branch prediction is also a research topic.
The longer the pipeline, the more serious the problems related to and transfer. Therefore, the longer the pipeline, the better, finding a balance between speed and efficiency is the most important thing.
Reprinted: http://hi.baidu.com/halleyzhang/blog/item/e45ed35c3ebb0442fbf2c09d.html