Processor Architecture (understanding the basic operating principles of CPU)-deep understanding of computer systems

Source: Internet
Author: User
Tags integer division


Processor Architecture


ISA

One processor supportsCommandAndByte encoding of commandsCalled itsInstruction Set architecture ISA.

Although the performance and complexity of the processors manufactured by each vendor are constantly improved, different models are compatible at the ISA level. Therefore, ISA providesConcept Abstraction Layer.

This concept abstraction layer is the ISA model: the instruction set encoding allowed by the CPU, and commands are executed in sequence, that is, an instruction is obtained first, and the next instruction starts after the execution is completed. However, the actual working mode of modern Processors may be quite different from the calculation model implied by ISA. By simultaneously processing different parts of multiple commands, the processor can achieve high performance. However, it must show external execution results that conform to the ISA model.

In computer science, clever methods are used to improve performance while maintaining a simpler and more abstract model function. This idea is well known (abstract ).

 

CPU Hardware Overview


Most modern circuit designs use high voltage and low voltage on signal lines to represent different bit values.

To implement a digital system, three main components are required:

① Calculate the function for operations on the counterpointCombination Logic(ALU)

② Storage spaceMemory Elements(Register)

③ Control the update of memory elementsClock signal

 

Logic GateIs the basic computing element of a digital circuit. The output is a Boolean function that is equal to their input bit values.

By combining many logic gates into a single network, you can build a computing block.Combination Circuit. (Equivalent to an expression)

 

Arithmetic/logical unit (ALU)Is a very importantCombination CircuitThe circuit has three inputs: two inputs and one control input. Based on the control input settings, the circuit performs different arithmetic or logical operations on the data input.

 

Memory and clock

In essence, a combination circuit does not store any information. They simply respond to the input signal and generate the output of a function that is equal to the input. To generateTime Series CircuitThat is, if a system is in a stateful state and computing status, we must introduce a device that stores information by bit.

All storage devices are controlled by the same clock,The clock is a periodic signal that determines when to load the new value to the device..

 

Most of the time, the registers remain in a stable State (expressed in x), and the output is equal to its current state. The signal is transmitted along the combined logic before the register. A new register input (represented by y) is generated ),However, as long as the clock has a low potential, the output of the Register remains unchanged. When the clock changes to a high potential, the input signal is loaded into the register., Which becomes the rising edge of the next State y until the next clock.

Registers Act as a barrier between the logic of combinations in different parts of a circuit. Each time the clock reaches the rising edge, the value is transmitted from the register input to the output..

 

Register File(A logical block composed of General registers) has two read ports and one write port. The circuit can read the values of two program registers and update the status of the third register. Each port has an address input to indicate which program register to select.

Although the register file is not a combination circuit, it has internal storage. However, reading data from a register file is like a combined Logical Block with an address as the input and data as the output.

 

Instruction Code

An important property of the instruction set is that the bytecode must have a unique explanation. Any byte sequence is either the encoding of a unique instruction sequence, or it is not a legal byte sequence.Because the first byte of each instruction has a unique combination of code and functions, given this byte, we can determine the length and meaning of all other additional bytes..

 

Each instruction must contain 1-6 bytes, depending on the required fields. The first byte of each Instruction indicates the type of the instruction. The 4-bit high is the code part (for example, 6 is an integer operation instruction), and the 4-bit low is the function part (for example: 1 is the subtraction command in the integer class) 61 is the combination of sub commands.

 

Process a sequence of commands:


Fetch)

In the value phase, the instruction bytes are read from the memory and put into the instruction memory (CPU). The address is the value of the program counter (PC.

It calculates the address of the next instruction of the Current Instruction in order (that is, the PC value plus the length of the obtained instruction)

 

Decode)

ALU reads a maximum of two operands from a register file (a set of general registers. (That is, you can read up to two registers at a time)

 

Execute)

In the execution phase, the arithmetic/logical unit (ALU) is used for different purposes based on the instruction type. For other commands, it is used as a calculator to increase or decrease the stack pointer, calculate the valid address, or simply add 0 to pass an input to the output.

The condition code register (CC) has three condition bits. ALU is responsible for calculating the new value of the condition code. When a jump command is executed, the branch signal cnd is calculated based on the condition code and the jump type.

 

Memory)

In the memory access phase, the data memory (in the CPU) reads or writes a memory word. Commands and data storage access the same storage location, but they are used for different purposes.

 

Write back)

At the write-back stage, you can write up to two results to register files. Register files have two write ports. Port E is used to write the value calculated by ALU, and port M is used to write the value read from the data storage.

 

Update a PC (PC update)

Select the value of the next PC from the signal value obtained from the previous step based on the command code and branch flag.



We useSEQ (sequentialFor example, explain the basic principles of CPU. At each clock cycle, SEQ executes all the steps required to process a complete instruction. However, this requires a long clock cycle time, so the clock cycle frequency will be low to unacceptable.



Sequence of SEQ



Combination LogicNo timing or control required-as longInput changedThe value is transmitted through the logic gate network.

We will alsoReadRandom Access to memory (register files, command memory, and data storage) is the same as the combined logic. (WriteRandom Access to memory needs to wait for high level)

Since instruction memory is only used to read instructions, we can regard this unit as a combination logic. (The memory writes commands to the instruction memory because the events outside the CPU do not belong to the CPU time series)

 

Each clock cycle, the program counter loads a new instruction address.

The condition code register is loaded only when Integer Operation commands are executed.

Data storage is written only when mov, push, and call commands are executed.

 

To control the time sequence of operations in the processor, you only need to control the registers and memory clock..

Because the command runs the calculation result, write the RegisterOrMemory.

We can regard the process of taking the pointer, decoding, execution, and so on as the process of combining logic (because they do not involve writing registers ). Writing back is another process.

The entire process can be simplified as follows:



Examples]

There are the following commands:

0x000: irmovl $0x100, % ebx

0x006: irmovl $0x200, % edx

0x00c: addl % edx, % ebx

0x00e: je dest

0x013: rmmovl % ebx, 0 (% edx)

0x019: dest: halt

 

In our SEQ processor, an instruction is executed in one clock cycle (that is, two high-level time intervals.

 


At the start of the clock cycle 3 (AT), a high-level entry, address 0x00c loaded into the program counter PC. In this way, the MCU (Master memory control unit) connected to the PC extracts the addl command at address 0x00c from the memory and loads it into the instruction memory. (Reading data from the memory is slow. This process will take a long time, so we have a long clock cycle to execute an instruction at a clock cycle.) At the same time, the value of the PC plus the length of the addl command generates the new PC value. The new PC value is transmitted through the bus and will be written into the PC at the next high power consumption.

 

Combination LogicWhen the input in the instruction Memory changes, the value (addl command) is transmitted through the logic gate network.Therefore, the values of % edx and % ebx in the register file are instantly read (because the register file does not need to be triggered at a high level)

 

The read values of % edx and % ebx instantly flow to the combined logic ALU. According to the addl command that was previously transmitted, ALU knows that this is an addition command, the valE results of these two values are instantly calculated. ValE can be transmitted through the bus to reach the register file instantly, but it cannot be written to the register file at this time. It must wait for the next high level.

Therefore, the result values of the preceding command are stored in both the register file and the memory. (Points 1 and 2)

 

At the start of clock cycle 4 (point 3), a high-level entry writes the new PC value generated in the previous cycle to the program counter, the valE value of the addl result calculated in the previous period is written to % ebx in the register file.

Because the address 0x00e is loaded into the program counter, the jump command je is taken out and executed. Because the condition code ZF is 0, no branch is selected. At the end of the cycle (point 4), the program counter has generated a new value of 0x013. However, the status in registers and memory remains the value set by the addl command until the next cycle begins.

 

[In this example, the clock is used to control the update of State elements and the value is transmitted by the combination logic. It is enough to control our SEQ to implement the computation of every instruction execution. Each time the clock changes from low to high, the processor starts to execute a new command .]

 

Read operations are transmitted along these units, as if they are a combination logic, while write operations are controlled by the clock.]

 

Note:] (Personal understanding)

In the early days, the CPU without pipelines was not executing an instruction in a cycle. Our SEQ processor was specially made to execute an instruction in a cycle to explain the CPU timing, it takes a very long time for a clock cycle (because we have to wait for the main memory to load the command to the command register, and some commands have to wait for the data register to write the data to the main memory ).

 

If the execution time of the command with the longest execution time is used as the clock cycle, the clock granularity is too large because the execution time required by different commands is different, when the clock granularity is too large, some commands are executed early, but the CPU is still idle, waiting for the end of this cycle. (Our six-step division targets all commands as a whole, and many commands only take a few steps)

 

Therefore, the early CPU designers divided the execution completed by a large combination of logic into several stages and completed by several group logic. Insert a register in the middle to save the intermediate result. Just like the pipeline mechanism described later, the next command starts to enter after an instruction is run.

The advantage of doing so is that some commands are completed after a few stages due to different stages involved in various commands, and some commands have to go through many stages, the time required to run a command is different, which is less than or equal to the time consumed by the command. (Each instruction of our SEQ processor must go through the same big combination logic. The clock cycle can only be set to the time of the most time-consuming instruction)

[Compare the paging mechanism of memory management to improve memory utilization.]

 

Intel 1978 in 8086 requires multiple (usually 3 ~ 10) clock cycle to execute a command. A more advanced processor can keep every clock 2 ~ The execution rate of four commands. In fact, each instruction takes a long time from the beginning to the end, about 20 or more cycles, however, the processor uses a lot of smart techniques to process up to 100 commands at the same time.

 


Split line -------------------------------------------------------------------------------------------



Assembly line principle


By organizing the steps required to execute each command into a unified process, we can use a small number of hardware units and a clock to control the computing sequence, thus implementing the entire processor. In this way, the control logic must route signals between these units and generate appropriate control signals based on the command type and branch conditions. (The CPU has three types of buses: control bus, address bus, and data bus)

 

SEQ processors cannot fully utilize hardware units because each unit is used only for a portion of the entire clock cycle. We will see that the introduced pipeline can achieve better performance.

 

In a streamlined system, tasks to be executed are divided into several independent stages..

For example, in automobile cleaning, these stages include spraying water, playing soap, scrubbed, waxing, and drying. Generally, multiple customers are allowed to go through the system at the same time, rather than waiting for a user to complete all the steps from the beginning to the end to let the next start.

When a car enters the scrubbing phase from the sprinkler phase, the next one can enter the sprinkler phase. Generally, a car must pass through the system at the same speed to avoid a crash.

 

An important feature of streamlined architecture isIncreased system throughputThat is, the total number of customers serving per unit time,It will also slightly increase latencyThat is, the time required to serve a user. (For example, in a non-pipeline system, a car that only needs to spray water can leave. In a streamlined system, no matter what your requirements are, you have to finish the entire process)

 

(The previous design was that after an instruction is executed, the next instruction can enter the CPU (the difference is the granularity of the clock cycle ). Pipeline allows multiple commands in the CPU. Each Command has the same time in the CPU. Even if you have finished executing the command in one cycle, you have to wait until the rest of the stage ends, delay Subsequent commands.

Although streamlined, all commands wait for the same time in the CPU (and all are calculated based on the most time-consuming command), their time overlaps. Assume that a command is waiting for 6 ms in the CPU, then 12 ms can process 7 commands, rather than the pipeline. Although a command can be executed for a maximum of 6 ms, their time is summed up, only 3 records can be executed in 12 ms. 12 = 6 + 2 + 4)

 

[Example]

A simpleNon-sequentialHardware System:

It consists of some computing logic and a register that stores the computing results. The clock signal is controlled to load registers at each specific interval.

(The decoder in the CD player is such a system. The input signal is the bit read from the CD surface. The logic circuit decodes the bit to generate the audio signal. The calculated block in the figure is implemented by the combined logic, which means that the signal passes through a series of logic gates. After a certain delay, the output becomes an input function)



In this example, we assume that the combined logic requires 300 ps and the loading register requires 20 ps. In this implementation, you must complete the previous one before starting the next command. It takes 320 ps to execute a command, that is, the system throughput per second is 3.12 GIPS.

 

StreamlineHardware System

Assume that the calculation executed by the system is divided into three stages (A, B, and C), each stage requires 100 ps, and then placed between each stagePipeline registerIn this way, each command goes through the system in three steps, and three clock cycles are required from start to end.

(The role of the pipeline register: As the function of different parts of the circuit Barrier between combined Logic. Save the operation result of the combined logic in each step. This is a register inserted to separate the flow .)




Pipeline, in a stable State, three stages should be active, each clock cycle, one command leaves the system, one new entry.

In this way, the time of a phase is equivalent to running a command. In this system, we set the clock cycle to 100 + 20 = 120 ps, And the throughput is about 8.33 GIPS. (This is in)


(From a macro perspective, an instruction is run in a clock cycle (this instruction is combined by various stages of multiple instructions). From the execution of a single instruction, it requires three clock cycles to execute a complete command .)

 

We increase the system throughput to 8.33/3.12 = 2.67 times, at the cost of adding some hardware (Pipeline registers) and a small increase in latency. The increase in latency is due to the time overhead of the added pipeline register.

The time of the clock cycle is the time of a stage separated by the pipeline. In this way, a clock cycle executes an instruction.

Note]

If the clock runs too fast, there will be disastrous consequences. The value may be too late to pass the combination logic, and when the clock goes up, the register input is not legal. (That is, the clock cycle is shorter than the time of one stage in the pipeline)

However, slowing down the clock will not affect the behavior of the pipeline. The signal is transmitted to the input of the pipeline register, but the register status will not change until the clock rises. (That is, the clock cycle is longer than the time of a stage in the pipeline)

Therefore, it is limited to increase the clock frequency by changing the value of the frequency multiplier.

 

Limitations of Pipelines


1,Inconsistent Division

Previously, it was an ideal streamlined system with the same time required for each stage. However, the latency of the actual system in each stage is generally different. The speed of the running clock is limited by the latency of the slowest stage. (That is, the system throughput is limited by the speed of the slowest stage)

 

2,The pipeline is too deep, but the benefits are declining.

For example, we divide computing into six stages, each of which requires 50 ps. Insert the pipeline register between each pair of stages to obtain a six-phase pipeline.

The minimum clock cycle of this system is 50 + 20 = 70 ps, And the throughput is 14.29 GIPS. The performance is 14.29/8.33 = 1.71 times higher than the three-phase flow. Due to the delay of the pipeline register, the throughput is not doubled. This delay has become a constraint on the throughput of the pipeline.

 

To increase clock frequency, modern processors use a deep (15 or more stages) pipeline.

 

Branch Prediction


The purpose of the streamlined design is to launch a new instruction for each clock cycle. To achieve this, we must immediately determine the position of the next instruction after taking out the current instruction.

However, if the command isCondition branch commandAfter several cycles, that is, after the command passes the execution phase, we can determine whether to select the branch. Similarly, if the command isRetThe return address can be determined only when the command is accessed.

 

For conditional transfer, we canPredictionIf the branch is selected, the new PC value should be valC.PredictionIf no branch is selected, the new PC value should be valP.

For ret commands, the possible return values are almost infinite, because the returned address is at the top of the stack, and its content can be arbitrary. In the design, we will not try to make any predictions on the returned address. Simply pause processing the new command until the ret command passes the write-back stage.

 

In either case, we must handle the predicted error in some way, because at this time the wrong command has been taken out and partially executed.

(Pipeline penalty to be written)

 

Pipeline adventure


Using pipeline technology, whenWhen there is a correlation between adjacent commandsProblems may occur.

These problems are:

1. Data-related: the result calculated using the next command

2. Control related: a command must determine the location of the next command, for example, when executing a jump, call, or return command.

These problems may cause computing errors in the pipeline.Adventure.

 

Use PauseTo avoid data risk

Stalling is a common technique to avoid risks. Let an instruction pause in the decoding stage until the instruction that generates its source operand passes the write-back stage, so that our processor can avoid data risk.

The pause technique is to block a group of commands at their stage and allow other commands to continue through the pipeline..

[Example]

Irmovl $10, % edx

Irmovl $3, % eax

Addl % edx, % eax

Halt

 

After decoding the addl command, the control logic is suspended and the data of the two source registers is risky.. (It finds that at least one instruction in the previous execution, memory access, or write-back phase will update the register % edx or % eax. We need to take the values of % eax and % edx in the next phase of addl, but it cannot be guaranteed that it is an updated value)

When the control logic is suspended, a bubble is inserted in the execution phase and the addl decoding is repeated in the next cycle..

It once again finds the risk of two source registers, inserts a bubble in the execution phase, and repeats the addl Decoding in the next cycle.

In fact, the machine dynamically inserts three nop commands. (Insert to the execution phase, rather than starting from the finger)

Irmovl $10, % edx

Irmovl $3, % eax

Bubble

Bubble

Bubble

Addl % edx, % eax

Halt

(This process is like a step forward when people are waiting in the queue, but another person is inserted in the gap before you, and your position remains unchanged, however, the people above have all taken a step forward. There are still vacancies, but there are still people inserting, you will remain unchanged)

When it is determined that the previous command has updated the values of the two registers we want, addl begins to move forward.

 

However, the performance of such a solution is not good. An Instruction updates a register, and the subsequent instruction uses the updated register. In such cases, there are numerous examples.This will cause the pipeline to be suspended for up to three cycles, seriously reducing the overall throughput..

 

Use ForwardingTo avoid data risk


In the decoding phase, the source operations are read from the register file, but the write operations on these source registers may be performed only in the write-back phase.It is better to simply upload the value to be written to pipeline register E as the source operand than to pause until the write is completed..

(That is, we do not have to wait until irmovl $10, % edx and irmovl $3, % eax completes the write update to the register before continuing addl, in the addl decoding phase, % edx and % eax values are found. The decoding logic does not read from the register file, but uses the value not written to the Register in the previous phase .)

The technology that directly transmits the results from a pipeline stage to an earlier stage is called Data Forwarding.



In period 4, the decoding logic found that the Register % edx was not written in the memory access phase, and the new value of register % eax was being calculated in the execution phase. It uses these values instead of the values read from the register file as values of valA and valB.

 

Loading/using data adventure


There is a type of data risk that cannot be solved simply by forwarding,Because the memory read (memory access phase) occurs late in the pipeline.

Example:

Mrmovl 0 (% edx), % eax

Addl % ebx, % eax

Halt

The command mrmovl reads the value at 0 (% edx) in the memory, which occurs in the memory access phase, and the command addl is already in the execution phase! It has read the value of % eax. That is, because the operation value obtained by the mrmovl command is relatively late, it is too late to send the command to be used later.

We canCombine pause and forwarding to avoid data loading/usage risks. (Since it is too late to send the following command, it will suspend the subsequent command for several cycles and then send it again)

 

When the mrmovl command passes the execution phase,Pipeline control logicIt is found that the instruction (addl) in the decoding stage needs to be read from the memory. It suspends the addl instruction in the decoding phase for a period, causing a bubble to be inserted in the execution phase. The value read by mrmovl commands from memory can be forwarded from memory access to addl commands in decoding.

 

This method of using pause to handle loading/using adventure is calledLoad lock.The combination of load lock and forwarding technology is sufficient to handle all possible types of data risk..

 

Exception Handling


Exceptions can be generated by program execution from the inside, or by an external signal.

Three simple internal exceptions:

1. halt command

2. Invalid commands

3. invalid access address

(There are also some external exceptions: The network port receives a new package, the user clicks the mouse, and so on)

 

In the simplified ISA modelWhen the processor encounters an exception, it will stop, set the appropriate status code, and it should be that all the commands before the exception command have been completed, subsequent commands should not have any impact on the visible state of the programmer.

In a more complete design, the processor continues to call the exception handling program, which is part of the operating system.

 

★Generally, by adding Exception Handling logic to the pipeline structure, we willEach pipeline register contains a status code Stat. If an exception occurs in a certain stage of a command in its processor, this status field is set to indicate the exception type.

The exception status and other information of this command are transmitted along the pipeline until it reaches the write-back stage.. Here, the pipeline control logic finds an exception and stops execution.

 

Exception events do not affect the command flow in the pipeline,Will prohibit the instructions behind the assembly line from updating the programmer's visible state(Condition code register and memory) until the exception command reaches the final pipeline stage.

Because the order in which commands arrive at the write-back stage is the same as that in which they are executed in a non-sequential processor, we can ensure that the first abnormal command will arrive at the write-back stage, at this time, the program execution will stop, and the status code in the assembly line register (W written back) will be recorded as the program status.

 

Multi-cycle commands for other problems

All commands in the previously designed processor instruction set include some simple operations, such as numerical addition. These operations can be completed within one period of execution.

In a more complete instruction set, there are also integer multiplication division and floating point operations. In our previous streamlined processor, floating point addition requires 3 or 4 cycles, and Integer Division requires 32 cycles.

 

One simple way to implement multi-cycle commands is to simplyExtended execution phase LogicAdd some integer and Floating Point Arithmetic units. When an instruction stays in the execution phase, multiple clock cycles are required. This will cause the receiving and decoding phases to be suspended. This method is easy to implement, but the performance is not very good.

 

By using special hardware functional units independent of the main assembly line to handle more complex operations, you can achieve better performance. Generally, there is a function unit for integer multiplication and division, and one for floating point operations (Coprocessor).

When an instruction enters the decoding stage, it can be sent to a special unit..When this special unit executes this operation, the pipeline will continue to process other commands. Generally, floating point units are also streamlined, So multiple commands can be executed concurrently in the main assembly line and each unit..

 

Operations on different units must be synchronized to avoid errors.

If there is data correlation between the commands executed by different units, the control logic may need to suspend a part of the system until the results of the operations processed by other parts of the system are completed.

Different forms of forwarding are used to pass the results from a part of the system to other parts, which is the same as the forwarding between various stages of the PIPE pipeline.. Although the overall design is more complex than PIPE, you can still use the same technologies such as pause, forwarding, and pipeline control to match the overall behavior with the sequential ISA model.

 

Interface with Storage System

In our previous streamlined CPU, we assume that both the specified unit and data storage can read or write any location in the memory within a clock cycle.

However,The actual situation is that we reference data with the virtual address of the storage location. This requires that the virtual address be translated into a physical address before the actual read/write operation is performed. Obviously, it is unrealistic to complete all these operations within a clock cycle.. Worse, the value of the memory to be accessed may be on the disk, which requires millions of clock cycles to read data into the processor memory.

 

Storage System:

CPU storage systems are composed of multipleHardware MemoryAnd manage Virtual MemoryOperating SystemSoftware.

The storage system is organized into a hierarchical structure,Fast but small memory maintains a subset of memory.And the slow but large memory is used as its backup.

The layer closest to the processor is the cache memory, which provides fast access to the most frequently used memory location. Generally there are two layers of cache-oneRead commands and read/write data..

There is another type of high-speed cache memory calledTLB(Translation Look-aside Buffer Translation backup Buffer ),It provides fast translation from virtual addresses to physical addresses.

When TLB and cache are used together, in most cases, it is indeed possible to read and write data within a clock cycle..

 

Cache miss: some referenced locations are not in the cache, that is, there is a cache miss. In the best case, hit data can be found from the master memory of a higher cache or processor, which requires 3-20 clock cycles. At the same time, the pipeline will simply pause and keep the command in the value or access storage phase until the cache can perform read or write operations.

 

Page missing exception: When the referenced storage location is actually in disk memory, the hardware generates a page missing exception signal. Like other exceptions, this exception causes the processor to call the exception handling program code of the operating system. The Code then initiates a transfer operation from the disk to the primary storage.

Let the hardware call the operating system routine, and then the operating system routine will return the control to the hardware, which enables the hardware and system software to work together when processing the missing pages..

 

From the processor's point of view, we can combine pause to handle short-time high-speed cache miss and exception handling to process long-time missing pages, all unpredictability caused by memory hierarchies during memory access can be taken into account.

 

 

 

 







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.