The problem is as follows: As the processor's pipeline grows longer and the clock speed increases, the performance loss caused by the branch problem becomes more and more obvious. According to statistics, branch commands account for 10% of the total number of commands (static), 15% (dynamic ). That is to say, each processing process is 6 ~ A conditional transfer command appears for seven commands. For example, the pipeline depth is 25. When conditional transfer occurs, the entire pipeline needs to be refreshed. This performance loss is intolerable. Although branch prediction can be used to reduce the overhead caused by conditional transfer, the problem cannot be completely solved.
First, we will study the execution speed of branch commands:
Exectime = predicttime + failrate * failpenalty
Exectime is the speed at which branch commands are executed, and predicttime is the speed at which commands are executed when the prediction is successful. Failrate is the probability of failure prediction, and failpenalty is the time it takes to restore the pipeline when the branch forecast fails. The reduction of failrate mainly relies on improving the accuracy of branch prediction. You can add some information in the program to tell the compiler that branch has a higher probability, but it is very difficult to further improve it. The appearance of Trace Cache can reduce predicttime and failpenalty.
For x86 instruction sets, decoding takes a very long time. Most of the work of the pipeline is in execution decoding. If this part of time can be removed or greatly reduced, it will greatly increase the speed of instruction execution. From this point of view, we will then examine the actual program. In fact, the processor always processes loops when executing programs. This is because the Modern processor can easily process more than 10 Gb of commands within 1 second. If there is no loop, the program file in this second will exceed 10 Gb, in fact, the size of executable programs is usually 10 KB ~ 20 mb, that is to say, every instruction in the program needs to be executed 1000 times on average ~ 1000,000 times. Based on the results of this observation, if we can allocate the decoding time of one time to 1000 or even 1000,000 commands, the execution speed of the command will be greatly improved.
Trace Cache is implemented based on this concept. If the instruction set in a window is decoded and saved to a buffer, if you need to execute the instruction set next time, you can directly read it from the buffer, no decoding will effectively reduce the overhead caused by failure in prediction of branch commands. There are two reasons: the first one is to reduce failpenalty. We can divide the processor pipeline into two parts: the random decoding part and the execution part. If the branch prediction fails, you only need to clear the pipeline of the execution part, instead of clearing the pipeline of the decoder part. The second reason is that predicttime is reduced. This is because if the branch prediction is successful, you can directly load the UOP generated after decoding from the trace cache, which is equivalent to eliminating the time consumed by the conditional instruction in the decoding pipeline. Considering that conditional commands account for about 15% of the total number of commands, this improvement is quite impressive.
Trace Cache can also bring other benefits:
- Eliminate the byte alignment required for instruction prefetch. To fully utilize the 16-byte alignment instruction prefetch requirements, the function and loop entry addresses usually need to be 16-byte alignment. The trace cache directly loads micro operations, which is not required.
- Decoding pairing can be eliminated. Generally, the processor executes simple commands and complex commands on different components. In order to fully implement the parallel operation between components, during optimization, the compiler needs to pair the decoded micro-operations in a certain mode. The trace cache directly loads micro-operations and does not need to be paired.
Although Trace Cache has many advantages, its biggest drawback is that it consumes a large amount of space. Taking up a large amount of chip space, this problem will gradually change as the integration technology advances. Reference: http://memcache.drivehq.com/memparam/Bench/Other/TraceCache.htm