The simplest method is to use multiple threads to accelerate the execution of time-consuming image processing algorithms (multi-core machines) in C ).

Source: Internet
Author: User

In image processing, there are many algorithms because their inherent complexity is a natural time-consuming large, and the data volume contained in the image itself is larger than the average object. Therefore, for such algorithms, the speed of execution depends largely on the Performance of hardware. Currently, popular CPUs are at least two cores, a little better than four or even eight cores. Therefore, if we can make full use of these resources, we will certainly be able to take advantage of the powerful advantages of machines and improve the performance of algorithms. In the single-core era, the main purpose of a multi-threaded program is to prevent the UI from being suspended. In general, the performance of a multi-threaded program is slower than that of a single thread. This situation was common before year 56, at any time, image programs written with VB6 may be less slow than those written with VC6. In the multi-core era, the rational use of Multithreading can improve the program speed linearly. In general programming tools, there are related classes that provide thread operations. For example, in VS2010, namespaces such as System. Threading and System. Threading. Tasks are provided to facilitate compilation of multi-threaded programs. However, it is inconvenient to directly use the Threading class. For this reason, Parallel computing classes such as Parallel are added in several later versions of C #. In actual encoding, Partitioner is used. create method, we will find that this class is particularly suitable for Parallel Computing in image processing. For example, the following simple code implements parallel computing of the reversed algorithm: private void Invert (Bitmap Bmp) {if (Bmp. pixelFormat = PixelFormat. format24bppRgb) {BitmapData Bmp DATA = Bmp. lockBits (new Rectangle (0, 0, Bmp. width, Bmp. height), ImageLockMode. readOnly, Bmp. pixelFormat); Parallel. forEach (Partitioner. create (0, BMP data. height ),( H) =>{ int X, Y, Width, Height, Stride; byte * Scan0, CurP; Width = BMP data. width; Height = BMP data. height; Stride = BMP data. stride; Scan0 = (byte *) BMP data. scan0; for (Y = H. item1; Y <H. item2; Y ++) {CurP = Scan0 + Y * Stride; for (X = 0; X <Width; X ++) {* CurP = (byte) (255-* CurP); * (CurP + 1) = (byte) (255-* (CurP + 1); * (CurP + 2) = (byte) (255-* (CurP + 2); CurP + = 3 ;}}); Bmp. unlockBi Ts (BMP data) ;}}compared with the classic Reversed code, it only adds Parallel. forEach (Partitioner. create (0, BMP data. height), (H) => and for (Y = 0; Y <Height; Y ++) to for (Y = H. item1; Y <H. item2; Y ++) but the efficiency is compared as follows (notebook I3cpu ): image size single thread time/MS multithreading time/ms1024 * 768 4 21600*1200 11 64000*3000 78 40 another example of the Color Removal Algorithm in Photoshop, if parallel computing is used, the code is private void Desaturate (Bitmap Bmp) {if (Bmp. pixelFormat = PixelFormat. format24bppRgb) {BitmapData bmp data = Bmp. lockBits (new Rectangle (0, 0, Bmp. width, Bmp. height), ImageLockMode. readOnly, Bmp. pixelFormat); Parallel. forEach (Partitioner. create (0, BMP data. height), (H) => {int X, Y, Width, Height, Stride; byte Red, Green, Blue, Max, Min, Value; byte * Scan0, CurP; width = BMP data. width; Height = BMP data. height; Stride = BMP data. stride; Scan0 = (byte *) BMP data. scan0; for (Y = H. item1; Y <H. item2; Y ++) {Cu RP = Scan0 + Y * Stride; for (X = 0; X <Width; X ++) {Blue = * CurP; Green = * (CurP + 1 ); red = * (CurP + 2); if (Blue> Green) {Max = Blue; Min = Green;} else {Max = Green; Min = Blue ;} if (Red> Max) Max = Red; else if (Red <Min) Min = Red; Value = (byte) (Max + Min)> 1 ); * CurP = Value; * (CurP + 1) = Value; * (CurP + 2) = Value; CurP + = 3 ;}}); Bmp. unlockBits (BMP data);} the principle of color removal is to take the color image RGB pass The average values of the maximum and minimum values are used as the color values of the new three channels. Make a speed comparison: image size single thread time/MS multithreading time/ms1024 * 768 5 21600*1200 15 84000*3000 117 60 reversed color and de-color are Lightweight Digital Image algorithms, however, the speed advantage of Multithreading is still available on multi-core CPUs. From the two simple examples above, we will first summarize some things about using Parallel. ForEach in combination with Partitioner. Create for Parallel computing. First, this kind of parallel programming is very convenient. Especially for data stored in a matrix-like way like images, algorithms are basically computed in the first-to-last-column or first-column mode. Second, all variables whose values change in Parallel programs must be defined in braces of Parallel. Otherwise, an inexplicable error may occur. Third: a single variable that is directly read within the Parallel code without being copied can be placed out of the Parallel braces, but it is also recommended to be placed in brackets, because it indicates that the speed will be faster, for example, the preceding variables such as Width and Height. Fourth, the starting and ending points of the internal for loop must be replaced by Item1 and Item2. Let's take a look at the example of a complex vertex algorithm. Here we will give an example of scaling fuzzy. All those who have used Photoshop know that most of PS filters provide the real-time preview function, but some filters, such as the zoom blur and the PS, are not provided. The reason is, that is, it requires a large amount of computing and cannot be implemented in real time. As shown in: At the same time, we select a large image, such as the above 4000*3000 image scaling magic, to observe the CPU usage, as shown in, the four cores are all in slow review. We can see that PS also uses multithreading for processing. The main code for parallel algorithm modification using C # is as follows: public static void ZoomBlur (Bitmap Bmp, int SampleRadius = 100, int Amount = 100, int CenterX = 256, int CenterY = 256) {int Width, Height, Stride; BitmapData Bmp DATA = Bmp. lockBits (new Rectangle (0, 0, Bmp. width, Bmp. height), ImageLockMode. readOnly, PixelFormat. format24bppRgb); Width = BMP data. width; Height = BMP data. height; Stride = BMP data. stride; byte * BitmapClone = (byte *) Marshal. allocHGlobal (BMP data. stride * BMP data. height); CopyMemory (BitmapClone, BMP data. scan0, BMP data. stride * BMP data. height); Parallel. forEach (Partitioner. create (0, Height, Height/Environment. processorCount), (H) => {int SumRed, SumGreen, SumBlue, Fx, Fy, Fcx, Fcy; int X, Y, I; byte * Pointer, PointerC; uint * Row, RowP; Fcx = CenterX <16 + 32768; Fcy = CenterY <16 + 32768; Row = (uint *) Mar Shal. allocHGlobal (SampleRadius * 4); for (Y = H. item1; Y <H. item2; Y ++) {Pointer = (byte *) BMP data. scan0 + Stride * Y; Fy = (Y <16)-Fcy; RowP = Row; for (I = 0; I <SampleRadius; I ++) {Fy-= (Fy> 4) * Amount)> 10; * RowP = (uint) (BitmapClone + Stride * (Fy + Fcy)> 16); RowP ++;} for (X = 0; X <Width; X ++) {Fx = (X <16)-Fcx; SumRed = 0; sumGreen = 0; SumBlue = 0; RowP = Row; (I = 0; I <SampleRadius; I ++) {Fx-= (Fx> 4) * Amount)> 10; PointerC = (byte *) * RowP + (Fx + Fcx)> 16) * 3; // * 3 without optimization, the compiler changes to lea eax, [eax + eax * 2] SumBlue + = * (PointerC); SumGreen + = * (PointerC + 1); SumRed + = * (PointerC + 2); RowP ++ ;} * (Pointer) = (byte) (SumBlue/SampleRadius); * (Pointer + 1) = (byte) (SumGreen/SampleRadius); * (Pointer + 2) = (byte) (SumRed/SampleRadius); Pointer + = 3 ;} Marshal. freeHGlobal (IntPtr) Row) ;}); Marshal. freeHGlobal (IntPtr) BitmapClone); // release the backup data Bmp. unlockBits (BMP data);} The CopyMemory function declaration is as follows: [DllImport ("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = true)] internal static extern void CopyMemory (byte * Dest, byte * src, int Length); Let's first look at speed improvement: image size single thread time (MS) multithreading time (MS) PS (s) 1024*768 926 556*0.71600 1200 2986 1214*1.54000 3000*21249 604 As shown in 7.2, the larger the image, the larger the time ratio between the single thread and multi-thread, and the more advantageous the multi-thread. In C #, multithreading is faster than PS, which does not fully indicate that PS is not doing well. This is because the algorithms are not completely consistent. Second, PS still needs to be processed. The code above is analyzed, and you can note that Parallel. forEach (Partitioner. create (0, Height, Height/Environment. processorCount), (H) => this sentence has a Height/Environment. processorCount code. The main purpose of this Code is to force parallel computing to only use Environment. processorCount threads maximize performance on the one hand, and on the other hand, the main reason is that Row = (uint *) implements Al. allocHGlobal (SampleRadius * 4) executes less Code and thus consumes less memory.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.