In the first two articles, we discussed two aspects of program performance, one is the algorithm (the generalized algorithm, the problem-solving method), and the other is the compiler. Through these two aspects, I would like to express that the efficiency of a program is difficult to draw conclusions from the surface, at least from some simple level, such as the length of the code is almost difficult to explain any problem-so must be profiling to achieve effective optimization. Now, we assume that the two algorithms are basically the same, and the compiler is simply "translating", so ... Can we see the performance from the surface?
So let's look at one of the simplest examples. Assuming that Dosomethinga and DOSOMETHINGB are fixed, what do you think is the better performance of the following two ways?
for (int i = 0; i < 100; i++)
{
DoSomethingA();
DoSomethingB();
}
for (int i = 0; i < 100; i++)
DoSomethingA();
for (int i = 0; i < 100; i++)
DoSomethingB();
These two logic algorithms are basically the same, if the compiler is simply "translation" without optimization, then the first approach to the sum of I and conditional jump less, so you may come to the conclusion that "very obviously" the first piece of code to perform more efficient. Unfortunately the fact is not that simple, because another key factor that affects program performance is caching.
"Caching" is everywhere. In the CPU, the fastest-performing storage device is a "register", but the number of registers is notoriously limited. Therefore, the CPU will have the L1 cache and L2 caches of multilevel caching mechanism. Among them, the performance of L2 cache is slower than L1 cache and register, but it is much faster than memory. When a core needs to retrieve data from memory, will get the data from the L1 cache, if L1 cache is not then will be from a number of nuclear shared L2 cache, and then will be taken from the memory-due to the operating system's virtual memory mechanism, may also be from the disk Exchange page to obtain data, The performance is naturally quite poor at this time.
Although registers only use a word length (such as 4 bytes) of data, but L1 cache from the L2 cache to take the data is always "a piece of" take-such a piece is often a continuous 64 bytes. In other words, after the data from one address that the CPU reads, reading data from other addresses will be faster than others because they are already in the L1 cache. If a program can take advantage of this feature of the CPU, its performance tends to be better (naturally there are many other factors that affect performance).
Locality (locality) is the noun used to describe whether a program can take advantage of a good cache. We say that a program's local is better, then it means that it can better use the CPU caching mechanism. Local division "space locality" and "Time locality" two aspects, the former refers to "loading an address data, continue to load the data near it," the latter means "after loading the data of an address, a short period of time reload the data." Either way, the goal is to load "hot" data from the faster cache. Why is cold start always slow? Why do some people say that the system will run faster after booting? In fact, the truth is almost the same.
So now, can you tell whether the above two approaches are more efficient or lower? Although the first approach reduces the number of cumulative and conditional jumps, it does two things in one loop, perhaps when the Dosomethingb method is executed, the data that has just entered the cache in the Dosomethinga method is cooled. The next time the Dosomethinga is executed, the data is reloaded from the slower storage device. And in the second approach, we "dense" to perform 100 times Dosomethinga or DOSOMETHINGB calls, and here a large number of data access are concentrated on the L1 cache, performance advantages are self-evident.
My previous article "Computer Architecture and Program performance" in the first part also discussed the local effect on the performance of the program, more specific, you can also refer to the content.
Because program directives are not the only factor in execution efficiency, judging program performance from the length of the code is also a very unreliable thing. Of course, it may not be appropriate to judge performance from any independent point of view. For example, in that article, it was mentioned that a global variable should be used for the sake of program performance--and, of course, the authors thought it was not a good design, in fact, in the example we just mentioned, doing more than one thing in a loop might also be worth refactoring. If you use a global variable, it does save the overhead of instructions such as Push,pop, but such a global variable-for example, a static variable, stored somewhere in the heap-is not a good practice in the locality. Conversely, because of the role of L1 cache, access to "parameter" or "local variable" on the call stack is no slower than the access registers, at which point the overhead of push,pop several instructions may not be much. Moreover, if the compiler/runtime inline this method, so even push,pop and so on instructions will not appear.
Remember some time ago when some friends posted some more "radical" statements on my blog, such as "The bottom of the study is just about writing." NET program is not helpful, because even if you do know it, C # has no way to assemble it inline. I don't agree with that, because even. NET program, it is also in accordance with the laws of computer architecture, we can fully understand a certain degree of code in the performance of the implementation.
Take the present "local", we can grasp a lot of things. For example, we know that each thread's call stack is 1 megabytes by default, so the data on the two thread call stack is almost impossible to appear in the same cache entry. For example, because of "time locality", most recently used data is most likely to appear in the cache, so the parallel libraries in. NET 4.0 tend to perform the most recently created tasks when scheduling the tasks of private queues. For example, do you use two int arrays to represent the X and Y values of a series of coordinates, or do you construct a struct point array to hold them? Although using a two int array saves memory, you may find it more appropriate to store the X and Y values of the same coordinates in a local consideration.
My articles, in fact, are also emphasizing the "uncertainty" in judging the performance of the program from the surface of the code. In the same way, you may find it difficult to "see" performance differences even if you put their assembly code (snippets) in front of you. This also illustrates the importance of profiling: The reading code is static, and both program execution and profiling are dynamic. A friend of mine said to me, "Have you been hooked on profiler lately?" "In fact, my profiling refers to" a way to explore the performance of a program, "not a particular instrument, not a specific tool--but whether it's using VS Profiler or doing a codetimer yourself, it's more reliable than" read code. "
Original: http://www.cnblogs.com/JeffreyZhao/archive/2010/01/12/talk-about-code-performance-3-locality.html