Also talk about performance: Experiment observation on locality and Performance

Source: Internet
Author: User

The sameAlgorithmWhy is there an order of magnitude of performance difference? The problem originated from a discussion with a netizen several months ago. This netizen is writing a wedding photo.Program. Generally, wedding photography images are large, even tens of millions of pixels. The problem with this user using C # development is performance. At that time, he was suggested to use xNa for development, but the problem came out again: loading an image takes several seconds! I wrote C # To load tens of millions of pixel images and image conversion operations in an instant. Why is there such a big difference? This is what we will talk about in this article.

The problem lies mainly in program locality and cache hit. Let's abstract the image class:

Bitmap
{
Width, height;
Data;
}

In general, it is divided into two storage blocks in the memory:

The program allocates a large block of memory to store specific image data, and then allocates a small block of memory for bitmap to store width, height, and reference to image data.

As a result, there will be performance differences between the two image operations:

Statement:

For (INT y = 0; y <XXX. height; y ++)
{
For (INT x = 0; x <XXX. width; X ++)
{
Data [x, y] = ....
}
}

Statement B:

Int width = xxx. width;
Int Height = xxx. height;
For (INT y = 0; y {
For (INT x = 0; x <width; X ++)
{
Data [x, y] = ....
}
}

In writing B, the width and height variables in the stack are used. Therefore, a has better locality than. Next, we will test the performance differences between the two methods.

====

Tutorial 1: assign values to 30 million pixel images in the hosted memory

Public class image
{
Public int width {Get; set ;}
Public int height {Get; set ;}
}

Public class bitmap: Image
{
Public int [] data;

Public Bitmap (INT width, int height)
{
This. width = width;
This. Height = height;
Data = new int [width * Height];
}

Public void fill (INT value)
{
Int Height = height;
Int width = width;

for (INT y = 0; y {< br> for (INT x = 0; x {< br> data [y * width + x] = value;
}< BR >}

Public void fillex (INT value)
{
For (INT y = 0; y {
For (INT x = 0; x <width; X ++)
{
Data [y * width + x] = value;
}
}
}

Public static void test ()
{
Bitmap IMG = new Bitmap (5000,600 0 );
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
}
}

The results are shown in the following table (unit: MS ).

1 2 3 4
Fill 126 (82) 83 84 85
Fillex 100 (141) 99 100 99

The brackets in the results of the "1" column are the test results after the fill and fillex are sequentially swapped. It can be found that the method executed for the first time in the same class suffers a loss. For this test, the loss of Eating 40 MS. Compared with 2, 3, 4, we can see that fillex is a little slower than fill.

====

Tutorial 2: assign values to 30 million pixel images in unmanaged memory

Public class unmanagedbitmap: Image
{
Public intptr data;

Public unmanagedbitmap (INT width, int height)
{
This. width = width;
This. Height = height;
Data = marshal. allochglobal (sizeof (INT) * width * Height );
}

Public unsafe void fill (INT value)
{
Int Height = height;
Int width = width;
Int * P = (int *) data;

for (INT y = 0; y {< br> for (INT x = 0; x {< br> * P = value;
P ++;
}< BR >}

Public unsafe void fillex (INT value)
{
Int * P = (int *) data;
For (INT y = 0; y {
For (INT x = 0; x <width; X ++)
{
* P = value;
P ++;
}
}
}

Public static void test ()
{
Unmanagedbitmap IMG = new unmanagedbitmap (5000,600 0 );
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
Codetimer. Time ("fill", 1, () =>{ IMG. Fill (1 );});
Codetimer. Time ("fillex", 1, () =>{ IMG. fillex (2 );});
}
}

Test results:

1 2 3 4
Fill 128 (83) 93 84 84
Fillex 88 (123) 90 84 84

It can be seen that there is almost no difference between fill and fillex.

====

In experiment 1, the loop of fillex is:

For (INT y = 0; y {
For (INT x = 0; x <width; X ++)
{
Data [y * width + x] = value;
}
}

In experiment 2, the loop of fillex is:

Public unsafe void fillex (INT value)
{< br> int * P = (int *) data;
for (INT y = 0; Y {< br> for (INT x = 0; x {< br> * P = value;
P ++;
}< BR >}

Compare the two paragraphsCodeAccording to the experiment results, we can see that in the for loop condition, the width and height attributes are specially processed by JIT. Therefore, no performance difference exists between fill and fillex in Experiment 2. The width attribute used in the loop body is not cached. As a result, the performance of fill and fillex In experiment 1 is obviously different.

However, the performance of the two methods shown in Experiment 1 is not very different. See experiment 3.

====

Experiment 3: Two writing methods produce performance differences of magnitude

For the core classes used in this experiment, see "publish my basic class of high-performance pure C # image processing, which also challenges the limits. :). The specific code can be found inHttp://smartimage.googlecode.com/svn/trunk/Download.

The two methods of the test are as follows. The differences between the two methods are marked in red:

Public unsafe void tobitmap (Bitmap map)
{
If (MAP = NULL) throw new argumentnullexception ("map ");
If (Map. Width! = This. Width | map. height! = This. Height)
{
Throw new argumentexception ("size mismatch .");
}

If (Map. pixelformat! = Pixelformat. format32bppargb)
{
Throw new argumentexception ("only supports format32bppargb format. ");
}

Int32 step = sizeoft ();
Byte * t = (byte *) startintptr;

Bitmapdata DATA = map. lockbits (New rectangle (0, 0, map. Width, map. Height), imagelockmode. readwrite, map. pixelformat );
Try
{
Int width = map. width;
Int Height = map. height;

Byte * line = (byte *) data. scan0;

For (INT h = 0; H {
Argb32 * c = (argb32 *) line;
For (int w = 0; W <width; W ++)
{
M_converter.copy (T, C );
T + = step;
C ++;
}
Line + = data. stride;
}
}
Finally
{
Map. unlockbits (data );
}
}

Public unsafe void tobitmapex (Bitmap map)
{
If (MAP = NULL) throw new argumentnullexception ("map ");
If (Map. Width! = This. Width | map. height! = This. Height)
{
Throw new argumentexception ("size mismatch .");
}

If (Map. pixelformat! = Pixelformat. format32bppargb)
{
Throw new argumentexception ("only supports format32bppargb format. ");
}

Int32 step = sizeoft ();
Byte * t = (byte *) startintptr;

Bitmapdata DATA = map. lockbits (New rectangle (0, 0, map. Width, map. Height), imagelockmode. readwrite, map. pixelformat );
Try
{
Byte * line = (byte *) data. scan0;

For (INT h = 0; H <map. height; H ++)
{
Argb32 * c = (argb32 *) line;
For (int w = 0; W <map. width; W ++)
{
M_converter.copy (T, C );
T + = step;
C ++;
}
Line + = data. stride;
}
}
Finally
{
Map. unlockbits (data );
}
}

Test code:

Public static void test ()
{< br> imageargb32 src = new imageargb32 (5000,600 0);
system. drawing. bitmap DST = new system. drawing. bitmap (5000,600 0, system. drawing. imaging. pixelformat. format32bppargb);
codetimer. time ("tobitmap", 1, () => {SRC. tobitmap (DST) ;});
codetimer. time ("tobitmapex", 1, () => {SRC. tobitmapex (DST) ;});
codetimer. time ("tobitmap", 1, () => {SRC. tobitmap (DST) ;});
codetimer. time ("tobitmapex", 1, () => {SRC. tobitmapex (DST) ;});
codetimer. time ("tobitmap", 1, () => {SRC. tobitmap (DST) ;});
codetimer. time ("tobitmapex", 1, () => {SRC. tobitmapex (DST) ;});
codetimer. time ("tobitmap", 1, () => {SRC. tobitmap (DST) ;});
codetimer. time ("tobitmapex", 1, () => {SRC. tobitmapex (DST) ;});
}

Test results:

1 2 3 4
Tobitmap 354 259 261 260
Tobitmapex 7451 7441 7440 7445

Since the difference is so significant, I have not exchanged execution order to repeat the experiment. The result shows that tobitmap is written nearly 30 times faster than tobitmapex. Now, I can know why it takes several seconds to load an image by using the program I mentioned earlier?

Next, we will make a small change to tobitmapex, and the change content will be highlighted in red:

Public unsafe void tobitmapex (Bitmap map)
{
If (MAP = NULL) throw new argumentnullexception ("map ");
If (Map. Width! = This. Width | map. height! = This. Height)
{
Throw new argumentexception ("size mismatch .");
}

If (Map. pixelformat! = Pixelformat. format32bppargb)
{
Throw new argumentexception ("only supports format32bppargb format. ");
}

Int32 step = sizeoft ();
Byte * t = (byte *) startintptr;

Bitmapdata DATA = map. lockbits (New rectangle (0, 0, map. Width, map. Height), imagelockmode. readwrite, map. pixelformat );
Try
{
Byte * line = (byte *) data. scan0;
Int width = map. width;
For (INT h = 0; H <map. height; H ++)
{
Argb32 * c = (argb32 *) line;
For (int w = 0; W <width; W ++)
{
M_converter.copy (T, C );
T + = step;
C ++;
}
Line + = data. stride;
}
}
Finally
{
Map. unlockbits (data );
}
}

Test results:

1 2 3 4
Tobitmap 313 (263) 261 261 260
Tobitmapex 268 (313) 261 264 261

The results are almost the same.

====

Summary:

(1) In some cases (Experiment 2), JIT can fully optimize the program locality.

(2) In some cases (experiment 1), JIT can partially optimize the program's locality.

(3) In some cases (experiment 3), JIT does not optimize the program locality.

The principle of compilation optimization is conservative. It must ensure correctness first. For example, the following code:

For (INT I = 0; I <XXX. width; I ++)

{

Xxx. width = 3;

}

JIT cannot be simply optimized:

Int W = xxx. width;

For (INT I = 0; I <W; I ++)

{

Xxx. width = 3;

}

However, the actual situation may be more complicated than this case, and JIT optimization will be very cautious and it is difficult to achieve the optimal effect. When writing high-performance programs, we should not rely on JIT optimization. In the case of experiment 3, cache the data that needs to be used in various places in the memory to the stack.

====

A few more words. C # performance problems in the program generally have little to do with the underlying mechanism of the language. The root cause of poor UI performance should be over-encapsulation. If there is a third-party lightweight UI library, the performance must be partial. Other performance problems are mainly related to the design. The main difference between the C # program and C/C ++ performance is that the focus is different. A very important objective of the C/C ++ Public Library Design is performance, C # currently, the main library preferences are in other aspects during design, while C # programmers prefer to write programs in other aspects. The performance of a well-designed C # program should be no less than 50% of the C/C ++ program. For complex programs, the design complexity of C/C ++ is high. At the same time, the design of C # program should be superior to that of C/C ++, therefore, the performance should reach 70% of C/C ++.

In the legend of the Great Tang Dynasty, the Jing Zhongyue of Kou Zhong and Xu ziling has reached the stage, and the Jian Xin of shixian is bright, and Shi Zhixuan is very subtle, fan of fuanda (TMD Huang Yi, the old boy, did not give my family a lovely man's day magic power 18 layer to take a similar Jing Zhongyue to the environment, Jian Xin transparent, Wei, Fan I do not have such loud name !), They are all connected.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.