Reason
Recently, I was writing some optimizations to string functions, but I was interested in it. However, I encountered a big pitfall when I wanted to implement-bit logical shift.
Logical displacement
We naturally think of the MMX and SSE displacement commands:
Logical Left Shift: PSLLW/PSLLD/PSLLQ, Shift Packed Data Left Logical (compression logic Left Shift) Logical Right Shift: PSRLW/PSRLD/psrscsi, Shift Packed Data Right Logical (compression logic Right Shift)
As the name suggests, W refers to Word, D Refers to DWORD, and Q refers to QWORD ), PSLLW implements left shift by Word grouping logic,
PSLLD shifts left by the grouping logic of DWORD, and PSLLQ shifts left by the grouping logic implemented by QWORD, all of which seem OK.
Here, the logical left shift is used as an example:
For details about the logic left shift instruction, refer:
Http://moeto.comoj.com/project/intel/instruct32_hh/vc256.htm
Or http://x86.renejeschke.de/html/file_module_x86_id_259.html,
The right shift is similar, so we will not describe it here.
The problem arises.
What we need to implement is the logical shift of bits. SSE2 contains the PSLLDQ command. The DQ here is the meaning of Double QWORD,
Isn't this exactly the-bit shift we need? No! Let's take a look at Intel's documents:
PSLLDQ -- Packed Shift Left Logical Double Quadword
Or
Http://moeto.comoj.com/project/intel/instruct32_hh/vc255.htm
As follows:
We can see that, unfortunately, SSE2 does not achieve-bit shift by bit. PSLLDQ can only achieve-bit shift by byte, that is, the minimum displacement must be one byte (eight bits ), this is very unscientific. Considering that Intel does not actually implement-Bit Data Processing (most SSE commands only implement a maximum of 64-bit Granularity Data Processing, for example, a double-precision floating point number is 64-bit), okay, we recognize it, !! But !! Intel, aren't you mistaken? PSLLDQ only supports imm8 operations. What does imm8 mean? Imm8 refers to the 8-bit immediate number, which means that we can only write dead (constants) in the Assembly and cannot use any registers for displacement. What the fu * K ??
Okay, so do we... You designed the CPU. We can't help you. If PSLLDQ supports reg32 and reg64 register displacement, it will be much more convenient, because we can first use PSLLDQ to shift the Byte displacement by enough digits, and then use PSLLQ to shift the remaining amount (this is the latter, why do we need to use this, you will know later), but this method is not feasible now !! This imm8 completely broke my eggs... PSLLQ can only shift 16 bits at a time for the 128 bit register (break through the slave). This means that if we use this method, we need to use if/jump several times...
Big pitfall begins
Well, let's go back to the next step. Since you cannot implement 128-bit shift by bit, we can divide it into two 64-bit shifts to achieve this. It is nothing more than one judgment, if you merge multiple times, although the efficiency is not as high as 128-bit, you have to do this...
Okay, let's get started .... GO !!! Now we have changed to PSLLQ. Run PSLLQ xmm0, 32 or PSLLQ xmm0, ecx (here the ecx value is 32), then? Why is xmm0 0 all zero ?? Ah, what's going on ??
Let's look back at intel's documents again:
The focus is on the two redlines. When PSLLQ acts on 64-bit registers, we can see that it supports the maximum COUNT = 64-bit displacement (strictly speaking, it is max = 63, this is a habit problem );
However, when PSLLQ acts on a 128-bit register, a strange thing happens. The maximum displacement is COUNT = 16 bits (15 bits in a strict sense), as shown in.
If I didn't re-read Intel's documents, but did not find any problems during debugging, who could think of moving at most 15 bits ??? Is Intel's head in the door ?? Why ?? On the MMX registers, a maximum of 63-bit displacement can be achieved. Why cannot the SSE register be used? Although we know that MMX registers and SSE registers are different and separate, MMX registers use x87 floating-point registers to implement MMX instructions, however, you have implemented 64-bit displacement in the MMX register. Why can only a maximum of 15 characters be moved in the 128-bit SSE register ?? You said it was hard to implement. I recognized it. I don't know why it was so difficult. We can only recognize it, but you implemented the 128-bit PSLLDQ Command Based on byte displacement, what is the explanation ?? Originally, as the name implies, PSLLDQ should be able to achieve a 128-bit shift by bit. due to historical reasons, I can understand this problem, however, you have no reason for PSLLQ to act on a 128-bit SSE register, but you can only shift at most 15 bits, right ?? Is this really so difficult ?? Is it really hard ???? It's really so difficult. How do you implement the 128-bit paybyte displacement of PSLLDQ ??
Seek answers
With these questions, we asked Mr. Google to search for "128-bit shift" and found that N's friends had encountered this problem, for example:
Looking for sse 128 bit shift operation for non-immediate shift value
What is SSE! @ # $ % Good? #2: Bit vector operations
Finally, Mr. Google told us the best answer, from Intel's forum, here:
Missing instruction in SSE: PSLLDQ with _ bit _ shift amount?
Yes, as follows:
Solution
Dddd