Introduction to program design of "Turn" "MMX" based on MMX instruction set

Source: Internet
Author: User

A

MMX Technology Introduction

Intel's MMX (multimedia enhanced instruction set) technology can greatly improve the processing power of the application for two-dimensional graphics and images. Intel MMX technology can be used for complex processing of large amounts of data and complex arrays, and the basic unit of data that can be processed using MMX technology can be byte (byte), Word (word), or double Word (double-word).
Visual Studio. NET 2003 provides support for the MMX instruction set features so that the functionality of the MMX instructions can be implemented directly using C + + code without having to write assembly code. By referring to the Intel software Manual (Intel software Manuals) [1] and reading the MSDN topic on MMX Programming technology, you can better grasp the essentials of MMX programming.
MMX technology realizes the execution mode of multi-channel data stream (Simd,single-instruction, multiple-data) with single Dow. Consider the following task, which requires programming, to add a number to each of the elements in a byte array, in a traditional program, the following algorithm is used to implement this function:
For every B in array//an array of each element b
b = b + N//plus a number n
Here's a look at its implementation details:
For every B in array//an array of each element b
{
Load B into the register
Add the number in this register to n
Put the results in the resulting register back into memory
}
Processors with MMX instruction set support have eight 64-bit registers, each of which can hold 8 bytes (byte), 4 words (word), or 2 double words (Double-word). MMX technology also provides a MMX instruction set in which instructions can load a numeric value (its type can be byte, Word, or double word) into these MMX registers, perform arithmetic or logical operations in the register, and then put the results in the register back into the memory storage unit. The above example uses the MMX technique after the algorithm is this:
For every 8 members in array//Take out the 8 bytes in the array (one of which is one of the arrays) as a group
{
Load these 8 bytes into the MMX register
With a CPU instruction execution cycle, add the 8 bytes in this register to n
Write the result of the calculation in the register back into memory
}
C + + programmers do not have to access these MMX registers directly using the instructions in the MMX instruction set. You can use the 64-bit data type __m64 and a series of C + + functions for related arithmetic and logic operations. It is the task of the C + + compiler to decide which MMX register the program uses and how to optimize the code.
Visual C + + mmxswarm [4] is an excellent example of image processing using MMX technology provided in MSDN, which contains a number of encapsulated classes that simplify the operation of MMX technology. And show you the operation of various formats of image processing (such as monochrome 24-bit pixel RGB, 32-bit pixel RGB, etc.). This article is just a brief introduction to the MMX program design using Visual C + +. If you are interested, you can refer to the MMXSwarm example on MSDN.


Two

MMX Program Design Detailed introduction
Included header file
All MMX instruction set functions are defined in the Emmintrin.h file:
#include <emmintrin.h>
Because the MMX processor directive used in the program is determined by the compiler, it does not have a related. lib library file.
__m64 Data types
This type of variable can be used as the operand of the MMX instruction and it cannot be accessed directly. Variables of type _m64 are automatically assigned a word length of 8 bytes.
CPU support for MMX instruction set
If your CPU can have the MMX instruction set, you can use the C + + libraries supported by the MMX instruction set provided by Visual Studio. NET 2003, and you can see an example of a Visual C + + Cpuid[3] in MSDN. It can help you detect whether your CPU supports SSE, MMX instruction sets, or other CPU functions.
Saturation algorithm (saturation arithmetic) and encapsulation modes (wraparound mode)
MMX technology supports a computational model called saturating arithmetic (saturation algorithm). In saturated mode, when an overflow (overflow or underflow) of the computed result occurs, the CPU automatically strips out the overflow so that the calculated result takes the data type to represent the upper value of the value (if overflow) or the lower value (if underflow). The calculation of saturation mode is used for image processing.
The following example allows you to understand the difference between saturated and encapsulated patterns. If a byte (byte) type variable has a value of 255, then the value is added one. In encapsulation mode, the sum result is 0 (to drop into a bit); In saturated mode, the result is 255. Saturation mode uses a similar method to handle the overflow, for example, for a number of byte data types in saturated mode, the result of 1 minus 2 is 0 (not 1). Each MMX arithmetic instruction has these two modes: Saturation mode and encapsulation mode. The items to be discussed in this article only use MMX instructions in saturated mode.
Programming examples
The following is an example of the MMX technology application in Visual Studio. NET 2003, where you can download the sample program compression package in Http://www.codeproject.com/cpp/mmxintro/MMX_src.zip. The package contains two items, and these two projects are based on the Microsoft Basic Class Library (MFC) established by Visual C + +. NET project, you can also follow the instructions below to build these two projects.
MMX8 Demo Project
MMX8 is a single document interface (SDI) application for simple processing of 8-bit monochrome bitmaps per pixel. The source image and the processed image are displayed in the form. The new ATL (Active Template Library) class CImage is used to extract images from the resource and display them in the form. The program has two processing operations on the image: Image color inversion and changing the brightness of the image. Each processing operation can be implemented in one of several ways:
Pure C + + code;
Code that uses the MMX function function of C + +;
Code that uses the MMX assembler instruction.
The time it takes to process the image is displayed in the status bar.
Image color Inverse function implemented in pure C + +:

void cimg8operations::invertimagecplusplus (Byte* psource, byte*int  Nnumberofpixels) {    forint0; i < nnumberofpixels; i++ )    {         255 -*psource++;    }}

In order to query the method using the MMX instruction function of C + +, we need to refer to the description of MMX assembly instruction in Intel Software specification (Intel software Manuals), first I found the general description of MMX related instruction in the eighth chapter of the first volume, A detailed description of these MMX instructions is then found in the second volume, which includes a subset of C + + functions related to their attributes. I then looked through the C + + functions corresponding to these MMX instructions to find out what was relevant to MSDN. The MMX directives and related C + + functions used in the MMX8 sample program are shown in the following table:
The implemented functionality corresponds to the MMX Assembly Directive Visual C + +. The MMX function in net
Clears the contents of the MMX register, which is initialized (to avoid collisions with floating-point operations). Emms _mm_empty
The corresponding (8) unsigned (8-bit) bytes of the two 64-bit numbers are simultaneously subtracted from each other. Psubusb _mm_subs_pu8
The corresponding (8) unsigned (8-bit) bytes of the two 64-bit numbers are simultaneously additive operations. Paddusb _mm_adds_pu8
In Visual C + +. NET's MMX instruction function realizes the image color inverse function:

voidcimg8operations::invertimagec_mmx (BYTE*Psource, BYTE*PDest,intnnumberofpixels) {__int64 I=0; I= ~i;//0xffffffffffffffff//process 8 pixels per cycleintNloop = nnumberofpixels/8; __m64* PIn = (__m64*) Psource;//input byte array pointer__m64* POut = (__m64*) pDest;//byte array pointer for output__m64 tmp;//Temporary work variables_mm_empty ();//perform MMX instruction: Emms, initialize MMX register__m64 N1=get_m64 (i); for(inti =0; i < Nloop; i++) {tmp= _mm_subs_pu8 (n1, *pin);//unsigned subtraction in saturated mode//perform operations on each byte: tmp = n1-*pin*pout =Tmp;pin++;//take the following 8 pixel pointspout++;} _mm_empty (); //perform MMX instruction: Emms, clears the contents of the MMX register}__m64 cimg8operations::get_m64 (__int64 N) {Union __m64__m64{__m64 M;__int64 i;} mi;mi.i=N;returnMI.M;}

Although this function was completed in a very short time, I recorded the time required for these 3 methods, and the following are the results of running on my computer:
Pure C + + code 43 ms
Code for using the MMX instruction function of C + + 26 ms
Code with MMX assembler instructions 26 MS
The above processing time must be in the program release optimization compiled after the implementation of the good results can be shown.
and change the brightness of the image I used the simplest method: To add and subtract the color value of each pixel in the image. This conversion function is a bit more complicated than the previous handler because we need to divide the processing into two cases, one is to increase the pixel color value and the other is to reduce the pixel color value.
A function to change the brightness of an image using pure C + + functions:

voidCimg8operations::changebrightnesscplusplus (BYTE*Psource, BYTE*PDest,intNnumberofpixels,intNchange) {if(Nchange >255) Nchange=255;Else if(Nchange <-255) Nchange= -255; BYTE b=(BYTE) abs (nchange);inti, N;if(Nchange >0)//increase pixel color value{ for(i =0; i < nnumberofpixels; i++) {n= (int) (*psource++ +b);if(N >255) n=255;*pdest++ =(BYTE) n;}}Else //reduce pixel color values{ for(i =0; i < nnumberofpixels; i++) {n= (int) (*psource++-b);if(N <0) n=0;*pdest++ =(BYTE) n; }}}

In Visual C + +. NET's MMX instruction function realizes the change of image luminance function:

voidcimg8operations::changebrightnessc_mmx (BYTE*Psource, BYTE*PDest,intNnumberofpixels,intNchange) {if(Nchange >255) Nchange=255;Else if(Nchange <-255) Nchange= -255; BYTE b=(BYTE) abs (Nchange); __int64 c=b; for(inti =1; I <=7; i++) {C= C <<8; c|=b;}//process 8 pixels in a single loopintNnumberofloops = nnumberofpixels/8; __m64* PIn = (__m64*) Psource;//array of bytes entered__m64* POut = (__m64*) pDest;//byte array of output__m64 tmp;//Temporary work variables_mm_empty ();//perform MMX instruction: Emms__m64 nChange64=get_m64 (c);if(Nchange >0 ){ for(i =0; i < Nnumberofloops; i++) {tmp= _mm_adds_pu8 (*pin, nChange64);//unsigned addition in saturated mode//perform operations on each byte: tmp = *pin + nChange64*pout =Tmp;pin++;//take the following 8 pixelspout++;}}Else{ for(i =0; i < Nnumberofloops; i++) {tmp= _mm_subs_pu8 (*pin, nChange64);//unsigned subtraction in saturated mode//perform operation on each byte: tmp = *pin-nchange64*pout =Tmp;pin++;//take the following 8 pixelspout++;}} _mm_empty (); //perform MMX instruction: Emms}

Note that the sign of the parameter nchange is checked only once in the loop body each time the function is called, rather than in the loop body, which is checked thousands of times. Here's how long it takes to process the image on my computer:
Pure C + + code 49 ms
Code for using the MMX instruction function of C + + 26 ms
Code with MMX assembler instructions 26 MS

Three

MMX32 Demo Project
The MMX32 project can process RGB images of 32-bit pixels. The work of image processing is the inverse operation of image color and the balance of changing image color (multiplying each color of pixel point by a certain value).
MMX multiplication is much more complicated than addition and subtraction, because the number of bits that the multiplication usually results in is no longer the size of the previous bits. For example, if the operand of the multiplication has a byte (8 bit byte) size, then the result will be one word (16 bits of word) size. This requires additional conversions, and the time difference between image conversions using MMX assembler instructions and C + + code is not very large (the difference is 5-10%).
In Visual C + +. NET's MMX instruction function implements the function of changing the image color balance:

voidcimg32operations::colorsc_mmx (BYTE*Psource, BYTE*PDest,intNnumberofpixels,floatFredcoefficient,floatFgreencoefficient,floatfbluecoefficient) {intnred = (int) (Fredcoefficient *256.0f);intNgreen = (int) (Fgreencoefficient *256.0f);intNblue = (int) (Fbluecoefficient *256.0f);//set multiplication factor__int64 C =0; c=nred;c= C << -; c|=ngreen;c= C << -; c|=nblue;__m64 Nnull= _m_from_int (0);//NULL__m64 tmp = _m_from_int (0);//Temporary Work TEMP variable initialization_mm_empty ();//clears the MMX register. __m64 Ncoeff=get_m64 (c);D word* PIn = (dword*) Psource;//Enter a double word setdword* POut = (dword*) pDest;//output Double Word group for(inti =0; i < nnumberofpixels; i++) {tmp= _m_from_int (*pin);//tmp = *pin (Low 32-bit write data at TMP)tmp= _MM_UNPACKLO_PI8 (tmp, nnull);//Converts 4 bytes of low in TMP to words//The high position of the word is filled with the bit value on the corresponding bit in the nnull. tmp= _MM_MULLO_PI16 (tmp, Ncoeff);//multiplies each word in TMP and sends the high of the multiply result to Ncoeff, leaving only the low of each result in TMP. tmp= _MM_SRLI_PI16 (TMP,8);//move each word in TMP to the right 8 bits, which is equivalent to dividing bytmp= _MM_PACKS_PU16 (tmp, nnull);//use the saturation mode to process the results in TMP as follows://Convert 4 characters in TMP to 4 bytes and write these 4 bytes to the low 32 bits in TMP//at the same time, the 4 characters in the Nnull are converted to 4 bytes, and the 4 bytes are written to the high 32 bits of the TMP. *pout = _m_to_int (TMP);//*pout = tmp (place tmp low 32 bits of data into the POut array)PIn++;p out++;} _mm_empty (); }

You can see the source code of the sample project for more details about this project.
SSE2 Technology
The SSE2 technology contains a set of instructions similar to the one in MMX for integer operations, and also contains 128-bit SSE register groups. For example, using SSE2 technology to change the color balance of the image can be much more efficient than using pure C + + code to achieve this function. SSE2 is also an extension of SSE technology, such as not only a single-precision floating-point array, but also an array of double-precision floating-point data types. The MMXSwarm sample project implemented in C + + uses not only the MMX instruction function, but also the function of the SSE2 instruction to the integer number operation.

Reference Documentation:
[1] Intel software Manual (Intel software Manuals): Http://developer.intel.com/design/archives/processors/mmx/index.htm .
[2] MSDN topic on MMX Technology: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/ Vcrefsupportformmxtechnology.asp.
[3] Example of Microsoft Visual C + + CPUID project: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/ Vcsamcpuiddeterminecpucapabilities.asp.
[4] Examples of Microsoft Visual C + + mmxswarm projects:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/ Vcsammmxswarmsampledemonstratescimagevisualcsmmxsupport.asp.
[5] A review by Matt Pietrek in the February 1998 issue of Microsoft Systems Journal: http://www.microsoft.com/msj/0298/hood0298.aspx

Reprint to: http://blog.itpub.net/8781179/viewspace-924611/

Introduction to program design of "Turn" "MMX" based on MMX instruction set

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.