Since cloud wind started using the Pentium 200mmx CPU in March this year, it has been considering how to use the MMX technology to accelerate the Alpha hybrid operation, especially for the currently commonly used high-color mode. previously, a foreign maillist on game programming discussed the result that MMX is not conducive to the Alpha mixing of 16-bit colors. let's take a look at the updates of MMX technology to the general instruction set to understand the arguments of this argument.
The advantage of MMX technology is that its registers are 64-bit, while the group mode is provided, which allows the data in the register to be 8 bytes or 4 words, the same operation can be performed on two or more dual words at the same time to facilitate data processing of large amounts of data. Group data can be compared at the same time, which brings benefits to batch judgment of transparent color points; the mmx cpu has eight MMX registers, which relieves the shortage of 80x86 CPU registers to some extent.
However, it also has many shortcomings. For example, arithmetic commands cannot operate on four byte characters, the command structure does not affect the flag bit, and cannot immediately address constants; MMX system instruction sets have very few instructions (not operations cannot be implemented directly );
When the color depth is 24/32 bits, RGB occupies 8 bits, in this way, the group multiplication commands in mmx can be skillfully used to achieve Alpha hybrid operations (only pmulhw/pmullw for word operations are required for MMX multiplication commands, this article aims to explore the fast Alpha hybrid operation of 16 bit colors, so we will not mention it here.
16-bit colors, red, green, and blue each occupy 5 or 6 places, making it difficult to be grouped separately, making it difficult to use these MMX features. of course, the other solution is to use the argb 4444 structure, where 4 digits are alpha channels, each pigment occupies half a byte, and then adopt a similar method.
If you have read the 16bit Alpha hybrid optimization algorithm proposed by Yunfeng last year, you may think of this algorithm as an extension to MMX. OK. Maybe you have understood it. Here is the basic theoretical point of this article, the only problem is that we need to face all kinds of defects in the MMX instruction set, which will be gradually reflected in the actual program design. below, Yun Feng will introduce the algorithm while introducing it, some techniques for using MMX will be introduced in this article)
First, let's see if the last algorithm can be further optimized:
The key to Alpha mixing in 16 bits is how to separate RGB so that subsequent multiplication results do not interfere with each other.
I proposed to extend 16-bit rrrrrggggbbbbb to 32bit and convert it into 00000gggggg00000rrr000000bbbbb. The green color in the middle will be higher than 16 bits, and the color interval will be 5 to 6 bits, for a five-digit color, the Alpha Level of more than five digits is meaningless, so as long as the Alpha value is set to 0 ~ The multiplication of these three pigments will not interfere with carry. here, you need to perform one more operation to expand 16-bit to 32-bit, and then perform one operation to separate the intermediate positions from 0, in addition, the results require the same complex inverse operation from 32-bit to 16-bit.
The idea of improvement is to directly split two vertices into rrrrr000000bbbbb00000GGGGGG00000 and 00000gggggg00000rrr000000bbbbb, and the first part shifted to 5 digits and then changed to callback, both digits can calculate three pigments at the same time. After the result is obtained, a group of five digits shifted to the right can be combined with the previous one. this saves several shift operations, and the data can be read in 4 bytes, and written in 4 bytes, which is very efficient. however, the traditional 80x86 has two restrictions on its application:
CPU registers are not enough. This method requires four 32-bit registers for data storage. Although EAX, EBX, ECX, and EDX are enough, however, this makes it impossible to directly write the Alpha hybrid function in the Blit operation. you must write a subroutine to call it. (but it is worth writing a try, isn't it? If a friend has completed writing, I hope to read it for me. I have left an interface in the wind soul game library and mentioned the specific function Writing Method in the comments)
In 2D games, the combination of Alpha is usually used to draw the genie rather than the regular rectangular bitmap. Therefore, there is still a transparent color judgment. If it is a dual-point processing, this step is not easy to implement. (but it is not a good method, that is, the code length is long and complicated :-()
However, MMX provides eight registers with group comparison commands, which makes up for these two shortcomings. In addition, the advantage of 64-bit registers can be used to calculate four points at the same time. so now we only use MMX to implement new ideas. (If you are interested in using this method in traditional instruction sets, and want to perform Alpha mixing at the same time with two vertices and write the actual code, please contact me, I very much hope that the non-MMX Alpha hybrid version of fengsoul can be further optimized)
The principle of using MMX for this work is almost the same (rather simple, isn't it ?), The read source and target are separated into four data records and placed in four registers. perform Alpha mixing between the two pairs (in this way, the six pigments are run simultaneously between a pair of data), and merge the results of the two pairs of Data mixing. However, from now on, we have to face the dilemma that the 8 registers of MMX are not enough. The MMX command cannot be used with the 64-bit immediate constant, therefore, the mask used during the split operation must be resident in the register. if the Register has enough hosts, you can add the inverse value of the mask. Unfortunately, you cannot waste so much time dealing with the transparent color problem. You can first compare the points with the transparent color to get a mask, then we will combine the vertices after the mixing and the vertices on the original target graph (This vertex should be retained with a backup, and another register is taken) merge with the mask logic operation to obtain the final data writing target chart. here, a lot of NOT operations are required. Intel did NOT provide @ #$ % ^ &! In the MMX command set &! We have to use PANDN to indirectly complete the operation. (For example, you can use PCMPEQW mm0 and mm0 first. (The constant ffffffffffffff is generated when you compare yourself with yourself. Use PANDN mm1 and mm0 to reverse mm1 .) here, we can no longer use the group multiplication of MMX (MMX cannot perform 32-digit multiplication), so we should use Shift and addition and subtraction. in this way, if there are several Alpha values, you should write several mixed functions. finally, an array of function pointers is created, which puts each level of Alpha mixed functions into an array in sequence. we can call the corresponding function based on the required Alpha value during the call.
In the wind soul 0.07, Alpha mixed again modified the algorithm, (0.06 using the above algorithm, 0.07 is not) here to thank the netizens T & P (tapu@371.net) new ideas. A simpler method can be used for Alpha mixing with a relatively small number of levels, such as Level 8. it can be noted that r = (R1 + R2)/2 in Alpha 50% can be approximately equal to R1/2 + R2/2. RGB can be conveniently used for simultaneous operations. you only need to perform a simple operation after the shift (0 rrrrrggggggbbbb & 011110111101111 = 0rrrr0ggggw.bbbb), and then add the data after the two shifts to complete the alpha = 50% mixing. this method avoids data splitting and restoration, so it is faster. in earlier versions of fengsoul, this special processing was performed on the Alpha level of 50%. however, there is an error with the deviation of 1/32 or 1/64 on each pigment caused by shift.
In the next step, we can promote the Alpha value of 50% to 25% 12.5% or even smaller. now let's take a look at the completion of R1 * 25% + R2 * 75%, which is equal to r2 + R1 * 25%-R2 * 25% = r2 + R1/4 + R2/4. here, the operation for Division 4 is the same as that for Division 2: (rrrrrggggggbbbbb> 2) & 0011100111100111. and so on, x * 37.5% + y * 62.5% = (x + y)/2 + Y/8-X/8. we only need to use Shift and addition and subtraction to complete the mixing of N pigments at the same time.
Let's take a look at the defects of this method. first, there is an error. The deviation between each group of shift operations will cause a maximum error of 1/32. Multiple operations may accumulate errors, so there is not much alpha Level score. in addition, when the alpha level is too small, the operation steps become much larger, and the advantage of direct operations without splitting may be lost. what's even more fatal is that if MMX acceleration is used, the mask used for the and operation should be placed in the register (if it is in the memory, and mmx cannot be addressed immediately, indirect addressing memory retrieval may not slow down the CACHE hit speed, resulting in too much loss of large-scale hybrid operations.) MMX has only eight registers. so many masks make it obvious that the Register is not enough, but this is a good method. the new alpha genie in fengsoul 0.07, the algorithm change in this step has brought about a speed improvement of about 10%, but the loss of image quality is hardly reflected :-)
Finally, we will discuss the bitmap with Alpha channels. Here each point has a different Alpha value, and we should coordinate the bitmap structure reasonably. it is not cost-effective to put the Alpha value and color information together. this is not conducive to high-speed processing. We can put the Alpha values of all vertices together. For a 16-bit color, a reasonable Alpha level should be below 16 levels. In this way, two Alpha values can be stored in each byte. A register can be used as a pointer to the Alpha value area, read the Alpha value of the corresponding vertex, and call the corresponding hybrid function operation. However, each point of the bitmap may have a different alpha value, so that multiple points cannot be operated simultaneously. Cloud wind has found another acceleration method. For details, see the following section for details.