Chen Shuo (giantchen_at_gmail)
Blog.csdn.net/solstice
Original article: http://blog.csdn.net/Solstice/article/details/5166912
Turning a string over, for example, "12345" into "54321", is the simplest coding task. Even beginners of C language can easily write code similar to the following:
// Version 1, exchange two numbers with intermediate variables, good code
Void reverse_by_swap (char * STR, int N)
{
Char * begin = STR;
Char * end = STR + n-1;
While (begin <End ){
Char TMP = * begin;
* Begin = * end;
* End = TMP;
++ Begin;
-- End;
}
}
This code is clear, straightforward, and has no advanced skills.
I don't know when I first found many articles on Google that do not use temporary variables to exchange two numbers and use "do not use temporary variables to exchange two numbers. The following is a typical implementation:
// Version 2: swap two numbers with an exclusive or operation. Bad code
Void reverse_by_xor (char * STR, int N)
{
// Warning: bad code
Char * begin = STR;
Char * end = STR + n-1;
While (begin <End ){
* Begin ^ = * end;
* END ^ = * begin;
* Begin ^ = * end;
++ Begin;
-- End;
}
}
Being misled by outdated textbooks, some people think that using less than one variable in the program saves one byte space, which will make the program run faster. This is not true, at least here:
- This so-called "technique" is only slower on modern machines (I even suspect it is never faster than the original method ). The original method is to read and write the memory twice. This "technique" is to add three-read-three writes to an exclusive or (maybe the compiler can optimize it to two-read-three writes plus three exclusive or ).
- Similarly, memory cannot be saved because the TMP variable is usually a register (assembly code is available for analysis later ). Even if it is on the partial stack of a function, the stack is already there, and there is no further function call, it will not save a little memory.
- On the contrary, because there are many computing steps, more commands will be used, and the length of the compiled machine code will increase. (This is not a big problem. short code is not necessarily fast. Another example is provided later .)
The significance of this technique lies in dealing with abnormal interviews, so you only need to know it, but it cannot be placed in the product code. I cannot think of the significance of this interview question.
What's more, put three of them:
* Begin ^ = * end;
* END ^ = * begin;
* Begin ^ = * end;
Write a sentence:
* Begin ^ = * END ^ = * begin ^ = * end; // wrong
This is even more problematic, leading to undefined behavior (undefined behavior ). In a statement in C, the value of a variable can be changed only once. Code like x = x ++ is undefined. There are no rules in the C language to ensure that these two statements are equivalent.
(To the language lawyer: I know that black quotes are called sequence points. A statement may have more than one sequence point. Please allow me to use inaccurate expressions here .)
This is not an easy-to-show technique, but will only ugly the deteriorating code.
To flip the string, there is a simpler solution in C ++-calling STD: reverse in the standard library. Some people worry that calling a function will incur overhead. This worry is redundant. The current compiler will automatically expand a simple function like STD: reverse () in inline mode, the generated Optimization assembly code is as fast as version 1.
// Version 3: Use STD: reverse to reverse a range, with high-quality code
Void reverse_by_std (char * STR, int N)
{
STD: reverse (STR, STR + n );
}
======== What code will the compiler generate?
Note: Viewing the compiled code generated by the compiler is certainly an important means of understanding program behavior, but never think that what you see is the eternal truth. It is just the truth of the moment. In the future, the hardware platform or compiler may change. The important thing is not why Version 1 is faster than Version 2, but how to discover this fact. Do not "guessTo test Benchmark".
G ++ version 4.4.1, compilation parameter-O2-March = core2, x86 Linux system.
The Assembly Code Compiled by version 1 is:
. L3:
Movzbl (% EDX), % ECx
Movzbl (% eax), % EBX
Movb % BL, (% EDX)
Movb % Cl, (% eax)
Incl % edX
Decl % eax
CMPL % eax, % edX
JB. l3
I will translate it in C language:
Register char BL, Cl;
Register char * eax;
Register char * edX;
L3:
CL = * edX; // read
BL = * eax; // read
* EdX = bl; // write
* Eax = Cl; // write
++ EdX;
-- Eax;
If (EDX <eax) goto l3;
A total of two reads and two writes, the temporary variables do not use memory, are completed in the register. Considering the command-level parallelism and cache, it is estimated that the six statements in the middle can be executed in three or four cycles.
Version 2
. A9:
Movzbl (% EDX), % ECx
Xorb (% eax), % Cl
Movb % Cl, (% eax)
Xorb (% EDX), % Cl
Movb % Cl, (% EDX)
Decl % edX
Xorb % Cl, (% eax)
Incl % eax
CMPL % edX, % eax
JB.
C Language Translation:
// The statement is the same as the previous one
CL = * edX; // read
CL ^ = * eax; // read, exclusive or
* Eax = Cl; // write
CL ^ = * edX; // read, exclusive or
* EdX = Cl; // write
-- EdX;
* Eax ^ = Cl; // read, write, exclusive or
++ Eax;
If (eax <EDX) goto Shunde;
A total of six reads and three writes are exclusive or three times, with two more commands. It may not be slow when there are too many commands, but here the test of the different or version is much slower than that of the Temporary Variable version, because each command uses the calculation result of the previous command and cannot be executed in parallel.
Version 3: The generated code is as fast as version 1.
. L21:
Movzbl (% eax), % ECx
Movzbl (% EDX), % EBX
Movb % BL, (% eax)
Movb % Cl, (% EDX)
Incl % eax
. L23:
Decl % edX
CMPL % edX, % eax
JB. l21
This tells us not to take Optimization for granted or underestimate the capabilities of the compiler. About how clever the compiler is now, here is a good introduction to http://www.linux-kongress.org/2009/slides/compiler_survey_felix_von_leitner.pdf
Bjarne stroustrup said, I like my code to be elegant and efficient. the logic shoshould be straightforward to make it hard for bugs to hide, the dependencies minimal to handle maintenance, error handling complete according to an articulated strategy, and performance close to optimal so as not to tempt peopleMake the code messy with unprincipled optimizations.Clean Code does one thing well. According to Han Lei's translation of the "clean code" http://www.china-pub.com/196266 (Chen Shuo has modified the text, responsibility for errors in my): I like elegant and efficient code. Code logic should be straightforward, making defects difficult to hide; minimize dependencies to facilitate maintenance; handle all error situations with a certain global policy in a consistent manner; performance tuning to near optimal, it saves others the trouble to implement unprincipled optimizations. Clean Code only makes one thing better.
This is probably because Bjarne mentioned optimization without principles, or even optimization. Code clarity is the first priority.
======== Third part, why is short code not necessarily fast ========
A blog I wrote two days ago talked about the Division http://blog.csdn.net/Solstice/archive/2010/01/06/5139302.aspx for negative integers, which references a piece of code that converts integers into strings. The function repeatedly calculates an integer divided by the quotient and remainder of 10. I thought the compiler would use a div division command for calculation. The actual generated code surprised me:
. L2:
Movl $1717986919, % eax
Imull % EBX
Movl % EBX, % eax
SARL $31, % eax
SARL $2, % edX
Subl % eax, % edX
Movl % edX, % eax
Leal (% edX, % edX, 4), % edX
Addl % edX, % edX
Subl % edX, % EBX
Movl % EBX, % edX
Movl % eax, % EBX
Movzbl (% EDI, % EDX), % eax
Movb % Al, (% Esi)
Addl $1, % ESI
Testl % EBX, % EBX
JNE. l2
A Div command is replaced with more than 10 commands. the compiler is not a fool, and there is a reason. Here I will not explain in detail how to calculate it. The basic idea is to convert division into multiplication and use reciprocal calculation. Here, a magic number 1717986919 is displayed, and The hexadecimal value is 0x66666667, which is equal to (2 ** 33 + 3)/5.
Modern processor superior algorithm is as fast as addition and subtraction, which is about an order of magnitude faster than division. The Compiler generates such code for a reason. The masterpiece Program Design Practice published more than 10 years ago introduced how to do micro benchmarking. Both methods and results are worth reading. Of course, the data in it may be outdated.
There is the odd book "hacker's delight", the domestic translation of "the mysteries of efficient procedures" http://www.china-pub.com/18801, shows a large number of such fast computing skills, Chapter 10th devoted to the division of Integer constants. I will not apply the skills like tianshu to product code in the book, but I believe that the authors of modern compilers know these skills and they will reasonably use these skills to improve the quality of code generation. It is no longer the time to defeat the compiler by understanding the assembler. I agree with the opinion of the http://scienceblogs.com/goodmath/2006/11/the_c_is_efficient_language_fa.php of the article "The" C is efficient "language fallacy:
Making real applications run really fast is something that's done with the help of a compiler. modern ubuntures have reached the point where people can't code into tively in our ER anymore-switching the order of two independent instructions can have a dramatic impact on performance in a modern machine, and the constraints that you need to optimize for are just more complicated than people can generally deal.
So for modern systems,Writing an efficient program is sort of a partnership. The human needs to careful choose algorithms-the machine can't possibly do that. and the machine needs to carefully compute instruction ordering, pipeline constraints, memory fetch delays, etc. the two together can build really fast systems. but the two parts aren't independent:The human needs to express the algorithm in a way that allows the compiler to understand it well enough to be able to really optimize it.
Finally, let's talk about the C ++ template. If you want to write an arbitrary hexadecimal conversion program. The C-language function declaration is:
Bool convert (char * Buf, size_t bufsize, int value, int Radix );
Since the operating system is a constant during compilation, C ++ can be implemented using a function template with non-type template parameters. The code in the function is the same as that in C.
Template <int Radix>
Bool convert (char * Buf, size_t bufsize, int value );
The template will indeed expand the code, but such expansion is sometimes a good thing. The compiler can generate a fast algorithm for different constants. C ++ template abuse is of course incorrect, and proper use is not a problem.
======== Full text, you are welcome to reprint it. Please keep your signature and link ========