Optimization of 16-bit Alpha hybrid MMX

Source: Internet
Author: User
Tags transparent color

Optimization of 16-bit Alpha hybrid MMX


Recently, I started to learn assembly Program I have been studying my engine for more than 20 days and have some experiences. I hope to share it with his family. The graphic operation in the game is to eat the system resource tiger, A large number of CPU clocks have been eaten by him. To speed up the graphics processing speed of my game engine, I have optimized some graphic operations and used amazing MMX commands. Why is the power amazing? I can see it later.

MMX commands are suitable for a large number of string operations. In MMX processors, 8 64-bit registers are provided, and MMX commands can use data group operations. Therefore, it is most appropriate to process images that operate one by one. Next, I will use the Alpha mixed Code For example, the powerful processing capability of MMX: Previously, I wrote a 16-bit Alpha operation on the homepage. Article At that time, I would not compile the program. I used the example program written in C. When Alpha is mixed in a 912x720 region, the FPS value is only 3-4, which is of no practical value. Now, I hold the sword MMX in my hand. After some efforts, the FPS value soared to 14, which is four times better than others! If it is under 640xlarge, the full screen alpha can reach more than 20 frames! Why is there such a big performance improvement? At first, I said that the MMX registers are 64-bit and there are eight mm0-mm7 registers. MMX commands allow the operation between MMX registers to be performed by byte, word, DWORD, and QD grouping. This is the key to the problem! Let's take a look. Our 16-bit Alpha mixture uses two bytes for a single vertex. In the middle of an MMX register, we can accommodate four of these vertices, So we perform a hybrid computing, the four dots can be mixed, which is why the speed is increased by 4 times !!! Well, let's look at the actual code: wddest and wdres are the destination and source image data pointers, __depth, respectively, to make the Alpha mixing depth, with a value of 0x0001000100010001 * ndepth, nmmxcount is used to record the number of MMX operations performed on a row. not_mmx_point is the remainder of the width of the copy rectangle except 4, because once the width is not 4-byte aligned, the remaining redundant points must be processed according to the common assembly code. Here, my value is left with an interface and is not implemented. Because the registers are not enough, push and pop are required. _ Mask64 is a _ int64 type mask with a value of 0x0001000100010001 * (short) m_ncolorkey, __mask is the escape code 0x001f001f001f001f001f used to filter out redundant colors (because the image is in the 555 format ). Everyone should note that every time movq is used to move four vertices to the MMX register, and then word is used to align the group computing. It should be noted that after MMX is used, the original transparent color can not be processed, and now it must be processed. In order to make the transparent part do not affect the target data, I used a little trick, the specific method is to view my code comments. This is what I have come up with myself: the complete code is required. Please download my complete japplib.

_ ASM
{
// Pusha;
Movq mm6 ,__ depth;
MoV eax, dword ptr wddest;
MoV EBX, dword ptr wdres;
MoV CX, nuseh; // records the height of the Operation rectangle.

Add_next_row:
Cmp cx, 0;
Je all_end;
XOR dx, DX;

Next_mmx_point:
CMP dx, nmmxcount;
Je not_mmx_point;
Movq mm0, [eax];
Movq MM1, [EBX];
Movq mm7, MM1;
Pcmpeqw mm7 ,__ mask64; // here it is used to handle transparent colors. My idea is that transparent colors are used to fill the source with the target colors.
Psubusw MM1, mm7; // in this way, the transparency part is the same as the source, no matter how Alpha is removed, as long as the formula is correct,
Pand mm7, mm0; // The color after they are mixed will not change, the add color effect is the same principle
Por MM1, mm7 ;//

Movq mm2, mm0; // G
Psrlw mm2, 5;
Pand mm2 ,__ mask;
Movq mm3, MM1;
Psrlw mm3, 5;
Pand mm3 ,__ mask;

Movq mm4, mm0; // R // wddr = (wddr-wdrr) * ndepth + (wdrr <5)> 5 );
Psrlw mm4, 10;
Pand mm4 ,__ mask;
Movq MM5, MM1;
Psrlw MM5, 10;
Pand MM5 ,__ mask;

// Psllw mm0, 1; // B
Pand mm0 ,__ mask;
// Psllw MM1, 1;
Pand MM1 ,__ mask;

Psubsw mm0, MM1;
Pmullw mm0, mm6;
Psllw MM1, 5;
Paddsw mm0, MM1;
Psrlw mm0, 5;
// Psrlw mm0, 5;
// Paddusw mm0, MM1;

Psubsw mm2, mm3;
Pmullw mm2, mm6;
Psllw mm3, 5;
Paddsw mm2, mm3;
Psrlw mm2, 5;
// Psrlw mm2, 5;
// Paddusw mm2, mm3;

Psubsw mm4, mm5;
Pmullw mm4, mm6;
Psllw MM5, 5;
Paddsw mm4, mm5;
Psrlw mm4, 5;
// Psrlw mm4, 5;
// Paddusw mm4, mm5;

// Psllw mm0, 10;
Psllw mm2, 5;
Psllw mm4, 10;

Por mm0, mm2;
Por mm0, mm4;

Movq [eax], mm0;

Add eax, 8;
Add EBX, 8;
Inc dx;
JMP next_mmx_point;

Not_mmx_point:
XOR dx, DX;

Not_mmx_next:
CMP dx, nnotmmx;
Je row_end;
Sub eax, 2;
Sub EBX, 2;
Inc dx;
JMP not_mmx_next;

Row_end:
Sub eax, nunused2;
Sub EBX, nunused1;
Dec CX;
JMP add_next_row;
// Loop add_next_row;

All_end:
// Popa;
Emms;
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.