Turn to a rare article-exploring the CPU Pipeline

Source: Internet
Author: User

A journey through the CPU pipeline Compilation: @ deuso_ict

AsProgramPersonnel, CPU plays a core role in our work, so it is not helpful for programmers to understand how the processor works.

How does the CPU work? How long does it take to execute a command? What does this mean when we discuss whether a new processor has 12 or 18 or even 31 pipelines?

Applications generally regard the CPU as a black box. The commands in the program enter the CPU in sequence, and then come out from the CPU in sequence after execution. What happened inside the program is usually unknown.

For our programmers, especially programmers who can perform procedural optimization, it is necessary to learn the details inside the CPU. Otherwise, if you do not know the internal structure of the CPU, how can you optimize the performance of the CPU?

This article focuses on the working principle of the x86 processor assembly line.

What preparations do you need

First of all, read this article to learn about programming. It is best to know a little about assembly language. If you do not know what instruction pointer is, this article may be difficult for you. You need to know what registers, commands, and caches are. If you don't understand what they are, you need to find out the information as soon as possible.

Second, the working principle of CPU is a very huge and complex topic. This article is just a quick glance, it is difficult to use an articleArticleDetailed description. If you have any omissions, please let me know through comments.

Third, I only focus on Intel processors and Their X86 architecture. Of course, in addition to x86, there are many other architecture processors. Although amd introduced many new features to the X86 architecture, the X86 architecture was invented by Intel and the x86 instruction set was created. Most of the features were introduced by Intel. So to keep the narration simple and consistent, I only focus on Intel's processor.

Finally, when you read this article, it is "outdated. New processors have been designed, and some of them will be released in the next few months. I'm glad that the technology has been developing so fast. I hope that one day all of these technologies will be outdated, creating CPUs with even more amazing computing power.

Basic processor Assembly Line

From a very broad perspective, the x86 processor architecture has not changed much in the past 35 years. Although many new features are added to the X86 architecture, the initial design (including almost all initial instruction sets) is basically completely retained, even on the latest processor.

The original 8086 processor supports 14 registers, which are still present in the latest processor. Four of these 14 registers are General registers: ax, BX, CX, and DX; four are segment registers, which are used to assist the implementation of pointers:CodeSegment (CS), data segment (DS), extended segment (ES) and stack segment (SS); there are four index registers used to point to the memory address: source reference (SI ), objective reference (DI), base pointer (BP), and stack pointer (SP). One register contains the status bit. Finally, the most important register is the instruction pointer (IP ).

The instruction pointer register is a pointer with special functions. The command pointer is used to point to the next command to be run.

All x86 processors run in the same mode. First, obtain the next running Command Based on the address indicated by the instruction pointer and parse the command (Decoding ). After decoding is complete, there will be a command execution phase. Some commands are used to read data from the memory or write data to the memory, and some commands are used to perform computation or comparison. After the command is executed, the command exits (retire) and changes the command pointer to the next one.

Decoding, execution, and exit three-level pipelines constitute the basic mode of X86 processor command execution. From the first 8086 processor to the latest core i7 processor, it basically follows this process. Although the updated processor adds more streamline levels, the basic mode has not changed.

What has changed over the past 35 years?

Compared with today's standards, the initial processor design is too simple. The initial execution process of the 8086 processor can be described as getting the instruction from the current instruction pointer, decoding, executing the final exit, and then continuing to get the instruction from the next instruction pointed to by the instruction pointer.

The new processor adds new functions, some new commands, and some new registers. I will focus on the changes related to the subject of this article, which affect the CPU instruction execution process. Other changes, such as virtual memory or parallel processing, are meaningful and interesting, but are not within the scope of this topic.

The instruction cache was added to the processor in 1982. Through instruction caching, the processor can read more commands from the memory at a time and store them in the instruction cache, instead of getting each command from the memory. The instruction cache contains only a few bytes and can only accommodate several commands. However, this greatly improves the efficiency because it eliminates the corresponding round-trip memory and processor time each time.

In 1985, 386 processors introduced data caching and expanded the design of instruction caching. Data access requests read more data in the data cache at a time, improving performance. In addition, the data cache and Instruction Cache are expanded from several bytes to several thousand bytes.

The isung processor launched in 1989 introduced a five-level pipeline. At this time, no more than one command is run in the CPU, and each level of pipeline runs different commands at the same time. This design improves IMG performance by more than doubled than the 386 processor of the same frequency. In the retrieval phase of the five-level pipeline, commands are extracted from the instruction cache (8 kb in the instruction cache in IMG); in the second-level decoding phase, commands are translated into specific functional operations; level 3 refers to the address conversion stage, which is used to convert the memory address and offset. Level 4 refers to the execution stage where the Command actually performs operations. Level 5 refers to the exit stage, the operation result is written back to the register or memory. Because the processor runs multiple commands at the same time, the program running performance is greatly improved.

In 1993, Intel launched the Pentium processor. Due to legal issues, intel cannot continue to use its original number. Therefore, Pentium is used to replace 586 as the code of the new processor. The Pentium processor has made more modifications to the pipeline than the isung processor. The Pentium processor architecture adds the second independent over-standard assembly line. The main pipeline works in the same way as isung. The second pipeline runs some simple commands in parallel, such as fixed-point arithmetic, and the pipeline can perform this operation more quickly.

In 1995, Intel launched the Pentium Pro processor. Compared with the previous processor, Pentium Pro adopts a completely different design. The processor uses many new features to improve performance, including out-of-order (ooo) Execution components and speculative execution. The assembly line has been extended to 12 levels, and the concept of "exceeding the threshold" has been introduced, so that many commands can be processed at the same time. We will introduce in detail the components for execution in disorder later.

During the period from 1995 to 2002, the unordered parts were greatly improved several times. More registers are added to the processor. The introduction of single-instruction multi-ple data (or SIMD) enables a single instruction to perform multiple sets of data operations; the existing cache becomes larger and new caches are introduced. Some streamline levels are split into more streamline levels, and some streamline levels are merged, making them more suitable for practical applications. These changes play an important role in improving overall performance, but they do not fundamentally affect the way data flows in the processor.

The Pentium 4 processor launched in 2002 introduced hyper-Threading Technology. The execution of out-of-order components allows the command to be executed faster than the processor can provide instructions. Therefore, for most applications, CPU execution components are idle for most of the time, and cannot be fully utilized even under high load. In order to allow the command flow to fully flow into the execution components in disorder, Intel has added the second set of front-end components (in the processor structure, the front-end refers to the module such as finger fetch, decoding, and register rename, after processing the front-end component, the command waits for the launch to enter the execution part in disorder ). Although there is actually only one out-of-order execution part, for the operating system, it can see two processors. The front-end part contains two sets of X86 registers with the same functions. The two instruction decoders process the two instruction pointers Based on the addresses pointed to respectively. All commands are executed by a shared out-of-order execution component, but are not informed of the application. When the execution of the unordered part is completed, the final result is returned to the virtual two processors after exiting the assembly line as before.

In 2006, Intel released the core micro-architecture. For the brand effect, it is called "Core 2" (better than "two ). Surprisingly, the processor frequency does not rise or fall, And hyperthreading is also removed. By reducing the clock frequency, more work can be done in each pipeline. The out-of-order execution part is also extended and wider. Different caches and queues are correspondingly larger. In addition, the processor is redesigned to adapt to the dual-core and quad-core shared cache structure.

In 2008, Intel began to name a new processor using core I3, I5, and i7. The new processor re-introduced hyper-threading. The main difference between the three series of processors is that the internal cache size is different.

Future processor: Intel's next generation microstructure is known as haswell. Haswell is said to be released on March 13, 2013. Currently, it is known that it will have 14 levels of sequential execution parts, so it still follows the basic design ideas since the Pentium Pro.

So what is the pipeline? What is an out-of-order execution part? How can they improve the performance of the processor?

CPU command line

According to the previous descriptions, the process of entering the assembly line, processing through the assembly line, and coming out of the assembly line is intuitive for our programmers.

Iworkflow has five levels of pipelines. Are: Fetch, decoding (D1, main decode), address (D2, translate), execute (ex, execute), and write back (WB ). A command can be executed at any level in the pipeline.

 

However, such a pipeline has an obvious defect. For the following command code, their function is to exchange the content of two variables.

1

2

3

XOR a, B

Xor B,

XOR a, B

There is no pipeline from 8086 to 386 processors. The processor can only execute one command at a time. In this architecture, the above code execution will not be faulty.

But the isung processor is the first x86 processor with a pipeline. What will happen when it executes the above Code? When you observe that many commands are running in the pipeline, you will feel confused, so you need to refer back to the figure above.

The first step is that the first instruction enters the get-to-point stage; then the first instruction enters the decoding stage in the second step, and the second Instruction enters the get-to-point stage; the first instruction enters the address-to-point stage in the third step, the second Instruction enters the decoding stage, and the third instruction enters the retrieval stage. However, there may be problems in Step 4. The first command enters the execution phase, but other commands cannot move forward. The second XOR Command requires the result a calculated by the first XOR command, but it will not be written back until the execution of the first command is complete. Therefore, other commands in the pipeline will wait until the execution and write-back phases of the first command are completed. The second command will wait for the first command to complete before entering the next level of the pipeline, and the third command will also wait for the second command to complete.

This phenomenon is called assembly line blocking or assembly line bubbles.

Another issue about the pipeline is that some commands are executed quickly and some commands are executed slowly. This problem is more obvious in the dual-pipeline architecture of the Pentium processor.

The Pentium Pro has 12-level pipelines. When this number was first announced, all programmers took a sigh of relief because they knew how the assembly line was working. If intel still follows the previous idea to design an excessive pipeline, the blocking and execution speed of the pipeline's commands will seriously affect the execution speed. But at the same time, Intel announced a completely different pipeline design, called out-of-order core ). It is hard to understand the benefits of these changes, but intel is convinced that these improvements are exciting.

Let's take a deeper look at this unordered part!

Disordered execution Pipeline

When describing the disordered execution pipeline, it is often a chart that is better than a thousand words. So we will mainly introduce the legend.

CPU Assembly Line

The isung processor has a 5-level assembly line. This design is common to other processors in the real world and is efficient.

The Pentium processor assembly line is better than isung. The two pipelines can run in parallel, and each pipeline can run multiple commands at different levels. It can execute almost twice as many commands as isung at the same time.

Commands that can be completed quickly need to wait for the execution of slow commands, even in the parallel pipeline is still a problem. The pipeline is still linear, making the processor face insurmountable performance bottlenecks.

The out-of-order execution component is very different from the linear path in the previous processor design. It increases some complexity and introduces non-linear paths.

The first change is the process of getting the instruction cache of the processor from the memory. The Modern processor can detect when a large branch jump (such as a function call) will be generated, and then load the commands of the jump destination to the instruction cache in advance.

The decoding level is slightly modified. Unlike the previous instruction that the processor only decoded the pointer to, the Pentium Pro processor can decode up to three instructions at a time. In today's processors (2008-2013), up to four instructions can be decoded per clock cycle. The decoding process generates many small pieces of operations, called micro-ops (micro- OPS ).

The next level (or several levels) is called microinstruction translation, followed by register aliasing ). Many operations are executed at the same time, and the execution order is out of order. Therefore, when one command reads a register, the other command is writing the register. Inside the processor, these original registers (such as ax, BX, CX, dx, etc.) are translated (or renamed) into internal registers that are invisible to programmers. Registers and memory addresses need to be mapped to a temporary location for command execution. You can translate four micro-commands in each cycle.

After microinstruction translation is complete, they will enter a Reorder Buffer (Rob), where Rob can store up to 128 microinstructions. On hyper-threading processors, Rob can also rearrange commands from two virtual processors. The two virtual processors collect micro commands into a shared out-of-order execution component in Rob.

These micro commands are ready for execution. They are placed in the reservation station (RS ). RS can store up to 36 micro commands at the same time.

The magic part of the unordered part is now started. Different micro-commands are executed simultaneously in different execution units, and each Execution Unit runs at full speed. As long as the data required by the current micro-command is ready and there are idle execution units, the micro-command can be executed immediately, and sometimes you can even skip the previously not ready micro-commands. In this way, operations that require a long period of operation will not block subsequent operations, and the loss caused by Assembly Line Blocking will be greatly reduced.

The disordered Execution Component of Pentium Pro has six execution units: two fixed-point processing units, one floating point processing unit, one fetch unit, one memory address unit, and one memory unit. These two fixed-point processing units are different. One can process complex fixed-point operations and the other can process two simple operations at the same time. Ideally, Pentium Pro's out-of-order execution component can execute seven micro-commands within a clock cycle.

Currently, the out-of-order execution part still has six execution units. The number unit, address unit, and number unit are not changed, and the number of the other three changes. These three execution units can execute basic arithmetic operations or execute more complex micro-commands. However, each execution unit is good at executing different types of micro-commands so that they can perform operations more efficiently. Ideally, today's out-of-order execution components can execute 11 micro-commands within a clock cycle.

In the end, the micro-commands will be executed, and after several flow levels, the assembly line will eventually exit. In this case, the instruction is complete and the instruction pointer is incremented. But from the programmer's point of view, the command only enters the CPU from one end and exits from the other end, just like the old 8086.

If you have carefully read the above content, you will notice a very important question mentioned above: What will happen if a jump occurs when the command is executed? For example, what happens when the command runs to "if" or "Switch? In older processors, this means to clear the pipeline and wait for the new jump destination command to be executed.

When more than 100 commands are stored in the CPU command queue, the performance loss caused by pipeline blocking is extremely serious. All commands need to wait for the command to jump to the target to retrieve and restart the pipeline. In this case, the unordered execution part needs to cancel all the executed micro-commands after the jump command, and return to the status before execution. When all the sub-commands executed in disordered order exit the sub-commands, discard them and start execution from the new address. This is quite difficult for the processor, and the frequency of occurrence is very high, so it has a great impact on performance. In this case, another important function of the execution parts in disorder is introduced.

The answer is speculative execution. Speculative execution means that when a branch command is run, the Execution Component executes all the branch commands in an out-of-order manner. Once the direction of the branch command is determined, all the wrong direction commands will be discarded. By executing two jump direction commands at the same time, blocking caused by branch jump is avoided. The processor designer also invented the branch prediction cache, which further improves performance when there are multiple branches. Although CPU blocking still occurs, this solution reduces the probability of CPU blocking to an acceptable range.

Finally, a processor with hyper-threading exposes two virtual processors to a shared out-of-order execution component. They share a re-sorting cache and out-of-order execution component, making the operating system think they are two independent processors and look like this:

 

Hyper-threading processor has two virtual processors, which can provide more data for disordered execution parts. Hyperthreading improves the performance of general applications, but for some computing-intensive applications, it will quickly make the execution parts in disorder saturated. In this case, hyper-threading reduces performance slightly. However, this is a minority case after all. hyper-threading is usually able to provide about twice the performance for daily applications.

Example

It seems a bit confusing, so let's take an example to make it clear.

From the application perspective, we still run on the command line, just like the old 8086 processor. The processor is a black box. The black box will process the instruction pointed to by the instruction pointer. After processing, it will find the processing result in the memory.

However, from the perspective of instruction itself, this process has experienced vicissitudes. Next we will introduce the internal process of an instruction for today's processors (around 2008-2013.

First, you are an instruction and your program is running.

You have been patiently waiting for the command pointer to point to yourself, waiting for the CPU to run. When the instruction pointer is 4 kb Away From You (about 1500 instructions), you are taken from the memory to the Instruction Cache by the CPU. Although loading from the memory into the Instruction Cache takes some time, it is still far away from the time you are executed. You have enough time. This prefetch process belongs to the first level of the pipeline.

When the instruction pointer is getting closer and closer to you and there are still 24 instructions, the five commands you and your side will be placed in the instruction queue.

This processor has four decoders that can hold a complex command and a maximum of three simple commands. You happen to be a complex command. Through decoding, you are translated into four micro commands.

The decoding process can be divided into multiple steps. One step in the decoding process is to check the data you need and guess that you may generate an address jump. Once the decoder detects the required extra data, it does not need to be known to you, and the data is loaded from the memory into the data cache.

Your four micro commands arrive at the register to rename the table. You tell it the memory address you need to read (for example, FS: [eax + 18 h]), and then rename the register table to convert the address to a temporary address for use by the micro command. After address conversion, your micro-commands will enter the Reorder Buffer (Rob) and record the order of commands. Then, log on to the reservation station (RS) immediately ).

The reserved site is used to store commands that are ready for execution. Your third micro-command is immediately selected and sent to Port 5. This Port performs operations directly. But you don't know why it is selected first. In any case, it is actually executed. After several clock cycles, your first micro command goes to Port 2, which is a read unit (load address Execution Unit ). The remaining micro-commands are waiting, and different micro-commands are being collected on each port. They all waited for Port 2 to load data from the cache and memory and put it in a temporary bucket.

They waited for a long time ......

For a long time ......

However, when they waited for the first micro-command to return data, another new command came in again. Fortunately, the processor knows how to make these commands run in disorder (that is, the microcommands that arrive at the retainsite are preferentially executed ).

When the first micro-command returns data, the remaining two micro-commands are immediately sent to the execution ports 0 and 1. Now all the four micro-commands have been run, and they will return to the reserved site.

After these micro-commands are returned, hand over their "tickets" and give their respective temporary addresses. With these addresses, You can merge them as a complete command. Finally, the CPU sends the result to you and causes you to exit.

When you arrive at the door marked with "exit", you will find a queue to be arranged here. When you enter, you find that you are right behind the instructions you come in. Even though the execution sequence may be different, the exit sequence remains the same. It seems that the execution parts in disorder really know what they have done.

Each Command finally leaves the CPU. Each command is in the same order as the command Pointer Points!

Conclusion

I hope this article will show readers the mysteries of processor work. It is not a magic.

Let's go back to the original question. Now we should be able to give some good answers.

How does the processor work internally? In this complex process, commands are first divided into smaller micro-Command commands, which are executed as quickly as possible in disorder and then submitted in the original order. Therefore, from the external perspective, all commands are executed in order. But now we know that the processor processes commands in disorder and sometimes runs the branch code in a hypothetical way.

How long does it take to run a command? This is an easy-to-answer question for processors that do not use pipeline technology, but for modern processors, the execution time of a command is related to the content of the command and the size and content of the cache. An instruction has a minimum time through the processor, but it can only be roughly said that this time is constant. A good programmer and compiler can run many commands at the same time, so that the time allocated for each command is almost zero. The almost zero execution time here does not mean that the total execution time of a command is very short. On the contrary, it takes a lot of time to read and write data through the entire out-of-order component and waiting for memory.

What does a new processor have 12, 18, or even deeper 31 pipelines? This means that more commands can be sent to the processing plant at the same time. A very deep pipeline allows hundreds of commands to be processed at the same time. When everything goes well, an out-of-order component can maintain high-speed operation to achieve amazing throughput. Unfortunately, the deep pipeline also means that the pipeline pause will change from a tolerable performance loss to a terrible performance nightmare. Because hundreds of commands have to be paused, waiting for the pipeline to resume operation.

How can I optimize the program based on this information? Fortunately, the CPU can work well in most common cases, and the compiler has been optimized for a disordered processor for nearly 20 years. When commands and data are executed in order (No annoying jump), the CPU can achieve the best performance. Therefore, first, use simple code. Simple and Direct Code helps the compiler's Optimization engine identify and optimize code. Try not to use the jump command. When you have to jump, try to jump to the same direction every time. Complex design, such as dynamic jump tables, although it looks cool and can indeed complete very powerful functions, it cannot be well predicted by both the processor and compiler, therefore, complicated code may cause pipeline pauses and guesses, which greatly damages the processor performance. Second, use a simple data structure. Data pauses can be prevented by keeping data order, adjacent, and continuous. Using the correct data structure and data distribution can greatly improve the performance. As long as the code and data structure are kept as simple as possible, the rest of the work can be safely handed over to the compiler's Optimization engine.

Thank you for joining me on this trip!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.