7.2 Optimization of pipeline

Last Update:2018-10-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Computer Composition 7 Pipeline Processor 7.2 Pipeline optimization

Compared to a single-cycle processor, pipelining can improve the performance of the processor, but it is not possible to take full advantage of the pipelining technology if only the steps that follow the instructions are used to slice the pipeline. So how can you tap into the more potential of pipeline technology? We will discuss this issue in this section.

We still use this kitchen cooking example to analyze the pipeline. Now we can divide the cooking work into four steps, each step takes a minute. It takes four minutes to do a dish without a line, and it takes more than 4 minutes to do it in a pipelined way. Because the handover between each phase will take a little extra time. Corresponds to the pipelined processor, which is the delay required by the pipeline registers between each pipeline level. In general, however, the time spent in these two approaches is roughly equal. And if the multi-course dishes, the advantages of the pipeline will gradually reflect, and in the pipeline full of circumstances, can be done every minute on a dish. Rather than the assembly line way, only to do every 4 minutes on a dish. This way we will find that if we divide the pipeline into 4 levels, we can improve the performance to 4 times times the original.

However, the situation is very idealistic, but it is difficult for us to do exactly the same time at each stage. So let's assume that the cut is very complicated, it takes 2 minutes, and the rest of the process is still only a minute, so how often should we let this trumpet player blow the horn? Obviously he should blow the number every two minutes. Because the clock cycle of our assembly line must be determined by the longest stage of the flow, otherwise if still in accordance with 1 minutes to blow a number, that the last dish has not been cut, the next dish has been washed, and sent to the cutting process, then the food has not been cut off, so obviously is not possible. So, we can only 2 minutes to blow a number, so only cut vegetables this link is full of work, other links are dry one minute to rest a minute. I am obviously unable to accept such a situation as the owner of this restaurant. This problem is the balance of the pipeline. At the end of each stage, the pipeline that spends unequal time is called an unbalanced pipeline. So for this line, we found that although he can also do every two minutes out of a dish, but the performance of the increase is a lot smaller, and from a separate dish, the non-assembly line only takes 5 minutes, and the pipeline way, instead of 8 more than a minute, so much slower. Therefore, if the pipeline processor, unbalanced pipeline for the overall instruction throughput rate, and a single instruction execution time, have a very bad impact. Therefore, when dividing the pipeline, it should be done at each level of the time to spend as equal as possible, which may result in the name of the pipeline and its actual completion of the work does not exactly match. For example, we ask to wash the food this link, just a little more time to tear up the vegetable leaves first, so cut vegetables can spend less time. Of course, this is just one way, and for this example we can consider another approach.

That is to cut the vegetables this link is divided into two steps, since he needs to spend 2 minutes, then we should divide it into 2 one-minute link. This and other aspects of the time will become the same, we can set the period of 1 minutes, every minute, each link to the hands of the results to the next link. However, we should note that such segmentation is not to add new hardware resources, but should be the original hardware resources cut into two parts to use. We still say cut vegetables, suppose we originally want to make potato silk, that in cutting vegetables, first with a knife to finish peeling work, and then use a broadsword to complete the work of cutting, so altogether spent 2 minutes of time. Later, we found that peeling is probably just a minute, cut into strips also need a minute, then we split it into two steps, the peeling knife placed in the first step, cut into the second step with the broadsword. So we don't need to buy new tools, but just separate the original tools. Then this adjusted pipeline becomes a balanced assembly line.

For this assembly line, we found that it has been a single dish of time has been reduced back to 5 minutes, and non-assembly line of the way basically quite, more importantly, in the case of continuous work, it can be done every minute on a dish. Rather than the assembly line, it can only be a dish every 5 minutes. Therefore, the use of the pipeline, the performance can be 5 times times the original, if we split the 4-level pipeline, then we can achieve the original performance of 4 times times. and slicing 5 levels of water, you can improve performance to 5 times times, then this thought to go on, we continue to do segmentation, is it possible to get higher performance? In simple terms, this is true. This technology is called "Super Pipeline".

Of course, this technique is not as magical as the name it looks. In fact, we are the five-level pipeline as a basic pipeline division, if on the basis of the five-level pipeline, some of the flow level is subdivided into more stages, thereby increasing the depth of the pipeline. Such a pipeline will be called Super pipeline.

Then the super-assembly line can achieve higher clock frequency, thus increasing the throughput rate of the instruction. For example, this is the basis of the five-level pipeline, where each stream level of the combined circuit of the delay of about 200PS, and the pipeline register delay of 50PS, then this pipeline processor clock cycle is 250ps. And if we do a 10-level pipeline, and just can be the five-level pipeline in each of the average cut to two segments, then this processor clock cycle is 100PS, plus the pipeline register 50PS, altogether is 150ps. It's obvious that using such a super-pipelined technique can lead to significant performance gains, but is the number of lines more the better?

We still have to analyze it in depth. For a class five pipeline, the delay of executing a single instruction is 1250PS, and for this 10-level pipeline, the delay of executing a single instruction becomes 1500ps. Therefore, after dividing the pipeline, we increase the frequency of the clock and thus increase the throughput of the instruction, but the execution time of the single instruction is indeed longer. This is because we have increased the number of pipeline registers, in the five-level pipeline, the line register delay of about 20% of the ratio. In the 10-step pipeline, the delay of the pipelined registers will not change because the delay of the combinational logic circuits in each level is halved. Therefore, the more Pipeline Series division, the higher the percentage of delay of the pipeline register, resulting in the delay of a single instruction is more and more large. And not only that, when the pipeline series becomes more and more, the instructions needed to fill an assembly line will become more, and these are in the pipeline, the relationship between them will become more complex, which will bring more negative effects, these negative effects we will be in-depth analysis later. But our conclusion is clear that the series of lines is obviously not the more the better.

Then let's take a look at the real situation. For example, 93 years of Pentium processor, it uses a five-level pipeline, and MIPS is in the early implementation of the pipeline processor, including five levels, and later, eight levels. Because MIPS at the beginning of the design of the full consideration of the implementation of the pipeline, it is easy to use the pipeline and super-line technology to obtain performance improvements. While the X86 command system was designed, it did not consider this point, and because it is more complex than MIPS itself, it is difficult to cut the pipeline. Therefore, in the 80 's to the early 90, MIPS-represented risk processor performance improved rapidly, posing a huge threat to the X86 processor. But then Intel found a way to deal with it in a very difficult situation.

This approach has been implemented since the 95 Pentium Pro. Pentium Pro is a 12-level pipelined processor, and there are some partitioning methods that consider it to be level 14. The core point of this is that, inside the processor, the complex X86 instructions are cut into simple risk instructions using hardware. This enables the use of these advanced risk processor technologies and ensures compatibility with software previously written with the X86 directive. Then after several battles, as Cisk representative of the X86 still defeated the fierce risk processor manufacturers, although it is also the risk technology.

And now arm, represented by risk, did not have much influence at the time, at least in the field of computers. This is the 97 ARM9, the use of five-stage pipeline, the division of the pipeline and MIPS is basically the same. Later, by 02, ARM introduced the ARM11 of the 8-stage pipeline, which has been widely used in the embedded field due to its low power consumption. However, due to the lack of performance, it is difficult to access the personal computer market.

At this stage, Intel, after defeating many risk manufacturers, and AMD and other manufacturers to start the X86 architecture of the struggle. At this time, the number of pipelined stages of the processor became more and more. From 10 to 20 multilevel, until the 04 Pentium 4 reached its peak, a total of 31 levels. We have just analyzed that the series of lines is not the more the better, too deep pipeline will reduce performance. That Intel would obviously not have known that, but why would he do that? One of the important reasons is that after the pipeline deepens, a noticeable change can be brought about-the increase in clock frequency. Although we now know that high frequency does not mean good performance, but in fact, many ordinary consumers do not know this, and in those years, Intel and AMD and other manufacturers, in advertising is the processor frequency as the main performance indicators. This caused the consumer to buy a computer when they only asked, "What is your computer frequency?" ", it turned out to be a 2G, in a few days that the family out of a 2.5G, then you have a 2.5G, in a few days, I will be a 3G. The frequency is high, the person who comes to buy is more. This is the most famous CPU frequency war of those years, which also led to the depth of the pipeline of mainstream processors. Of course, such things can not be done all the time, it has a limit, so after the frequency war can not go down, the depth of the pipeline gradually fell down. For example, the 06 Core 2 is Level 14, and the 08 Core i7 is Level 16. As of today, the Core i7 in 13 is also level 14 under normal working conditions. That's the main case for desktop processors.

And as Steve Jobs launched the iphone and ipad to detonate another market, let's take a look at its pipeline depth. For example, the 09 cortx-a8, which is the processor of the ARM11 generation, has a 13-level pipeline. The 10 cortex-a9, is the 11-level assembly line. We can see that its pipeline series does not rise and fall. The 11 Cortex-a15, is the 15-level pipeline, to 13 Cortex-a57, also still maintained this number. and A9,a15 and A57, these processors, or some other processor with their architecture, are widely used in mid-and high-end smartphones and tablets. Therefore, it can be seen from here, the mainstream of the current processor, its pipeline depth is basically maintained at about 15 levels.

Now we have learned that by increasing the depth of the pipeline, the clock frequency can be effectively increased, thus increasing the throughput rate of the instruction. But this approach also has a lot of limitations, and now it's hard to get a performance boost by continuing to increase the depth of the pipeline. If we want to continue to tap the potential of the pipeline processor, we have to look for other optimization options.

7.2 Optimization of pipeline

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

7.2 Optimization of pipeline

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

7.2 Optimization of pipeline

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support