Analysis on trace cache of Microprocessor

Last Update:2014-08-05 Source: Internet

Author: User

Tags prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The problem is as follows: As the processor's pipeline grows longer and the clock speed increases, the performance loss caused by the branch problem becomes more and more obvious. According to statistics, branch commands account for 10% of the total number of commands (static), 15% (dynamic ). That is to say, each processing process is 6 ~ A conditional transfer command appears for seven commands. For example, the pipeline depth is 25. When conditional transfer occurs, the entire pipeline needs to be refreshed. This performance loss is intolerable. Although branch prediction can be used to reduce the overhead caused by conditional transfer, the problem cannot be completely solved.

First, we will study the execution speed of branch commands:

Exectime = predicttime + failrate * failpenalty

Exectime is the speed at which branch commands are executed, and predicttime is the speed at which commands are executed when the prediction is successful. Failrate is the probability of failure prediction, and failpenalty is the time it takes to restore the pipeline when the branch forecast fails. The reduction of failrate mainly relies on improving the accuracy of branch prediction. You can add some information in the program to tell the compiler that branch has a higher probability, but it is very difficult to further improve it. The appearance of Trace Cache can reduce predicttime and failpenalty.

For x86 instruction sets, decoding takes a very long time. Most of the work of the pipeline is in execution decoding. If this part of time can be removed or greatly reduced, it will greatly increase the speed of instruction execution. From this point of view, we will then examine the actual program. In fact, the processor always processes loops when executing programs. This is because the Modern processor can easily process more than 10 Gb of commands within 1 second. If there is no loop, the program file in this second will exceed 10 Gb, in fact, the size of executable programs is usually 10 KB ~ 20 mb, that is to say, every instruction in the program needs to be executed 1000 times on average ~ 1000,000 times. Based on the results of this observation, if we can allocate the decoding time of one time to 1000 or even 1000,000 commands, the execution speed of the command will be greatly improved.

Trace Cache is implemented based on this concept. If the instruction set in a window is decoded and saved to a buffer, if you need to execute the instruction set next time, you can directly read it from the buffer, no decoding will effectively reduce the overhead caused by failure in prediction of branch commands. There are two reasons: the first one is to reduce failpenalty. We can divide the processor pipeline into two parts: the random decoding part and the execution part. If the branch prediction fails, you only need to clear the pipeline of the execution part, instead of clearing the pipeline of the decoder part. The second reason is that predicttime is reduced. This is because if the branch prediction is successful, you can directly load the UOP generated after decoding from the trace cache, which is equivalent to eliminating the time consumed by the conditional instruction in the decoding pipeline. Considering that conditional commands account for about 15% of the total number of commands, this improvement is quite impressive.

Trace Cache can also bring other benefits:

Eliminate the byte alignment required for instruction prefetch. To fully utilize the 16-byte alignment instruction prefetch requirements, the function and loop entry addresses usually need to be 16-byte alignment. The trace cache directly loads micro operations, which is not required.
Decoding pairing can be eliminated. Generally, the processor executes simple commands and complex commands on different components. In order to fully implement the parallel operation between components, during optimization, the compiler needs to pair the decoded micro-operations in a certain mode. The trace cache directly loads micro-operations and does not need to be paired.

Although Trace Cache has many advantages, its biggest drawback is that it consumes a large amount of space. Taking up a large amount of chip space, this problem will gradually change as the integration technology advances. Reference: http://memcache.drivehq.com/memparam/Bench/Other/TraceCache.htm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More