For c++/c++ programmers, inline assembler is not a new feature that can help us make the most of our computing power. However, most programmers rarely have the opportunity to actually use this feature. In fact, inline compilations serve only specific requirements, especially when it comes to advanced high-level programming languages.
This article describes two scenarios for the IBM Power processor architecture. Using the examples provided in this article, we can find out where the inline assembler is applied.
Scenario 1: Better libraries
C + + programming languages support logical operations. In this case, the http://www.aliyun.com/zixun/aggregation/6579.html "> user use bit as the base unit." The user has written an algorithm to calculate the number of digits occupied by a 32-bit variable.
Code A: Calculate the number of digits occupied
inline int bit_taken (int data) {taken = 0;04 while (data) {1]; taken++;07}08 return taken;0 3 ·
This code shows how to work with loops and shift operations. If the user compiles code with the highest level of optimization (-O3 applies to GCC,-O5 for XLC), the user may find that some optimizations (such as expansion, constant data propagation, and so on) are automatically completed and can generate the fastest code in the world. But the basic idea of the algorithm has not changed.
Description of the list A:CNTLZW
CNTLZW (Count leading zeros Word) directive
Objective
In the future, the number of leading zeros of the source universal registers is put into a universal register.
The CNTLZW instruction can get the number of leading zeros. We take the number 15 as an example, the binary is represented as 0000, 0000, 0000, 0000, 0000, 0000, 0000, and 1111,CNTLZW will tell you that there are 28 leading zeros in total. After a rethink, the user decides to simplify its algorithm, as shown in code B.
Code B: Calculate the number of digits occupied by the inline assembly
#ifdef __arch_ppc__02 Inline int bit_taken (int data) {int taken;05 asm ("CNTLZW%0,%1\n\t": "=b" (Taken) Modified: "B" ( data); sizeof (data) * 8–taken;10}11 #else ... #endif
Macros with name __arch_ppc__ only wrap new code that applies to the PowerPC schema. Compared to code A, the new code has removed all loops or shifts. Then, the user may be pleased to see the performance of the Bit_taken improved. It runs faster on PowerPC. Also, application-bound Bit_taken even perform better.
This story does not only show that the user can improve his algorithm with rich instructions, but also that inline assembly is the best helper to improve performance. By embedding assembly code into C + +, you can minimize the effort of users to modify code.