One of SSE special instruction sets

Source: Internet
Author: User

In fact, many times of Assembly optimization are dealing with how to effectively organize data to adapt to the data structure of parallel computing commands.

This section describes the data scrubbing commands, which are quite flexible to use. The details are as follows:

 

1.ShufpsXMM, XMM/m128, imm8 (0 ~ 255)

Description:

From the instruction suffix, This Is A sse1 instruction.

This command divides the source memory and destination register by 32-bit dual-character,Number of imm8 binary digits immediately(00 ~ ^ ~ ~ 11) Specify the arrangement,

Destination register high 64-bit source memory number specified,The number of destination registers that are 64-bit low. The memory variable address must be 16 bytes aligned

The high 4-bit imm8 selects the source memory and the low 4-bit selects the destination register.

High 64-bit | low 64-bit

Destination register: A (11) | A (10) | A (01) | A (00)
Source register: B (11) | B (10) | B (01) | B (00)
Destination register arrangement result: B (00 ~ 11) | B (00 ~ 11) | A (00 ~ 11) | A (00 ~ 11)
The value in the destination register compression result is specified by two binary digits corresponding to imm8.

Example:
(11) (10) (01) (00) (11) (10) (01) (00)
When xmm0 = 0x 090a0b0c 0d0e0f11 01020304 05060708,

Xmm1 = 0x 0 aabbccdd eeff1234 22334455 66778899,

Mm8 linear regression> (xmm1 10) (xmm1 01) (xmm0 11) (xmm0 00)

Run shufps xmm0, xmm1, 10 01 11 00 B (Binary ),

Xmm0 = 0x 0eeff1234 22334455 090a0b0c 05060708

If shufps xmm0, xmm1, 10 10 10 10 B, the result is: xmm0 = 0x 0eeff1234 eeff1234 0d0e0f11 0d0e0f11

 

A common usage of this command is as follows:

Float F = 0.5f;

_ ASM

{

Movss   Xmm2, F                                               // Xmm2 [0] = 2.8
              Shufps Xmm2, xmm2, 0                                   // Xmm2 [1, 2, 3] = xmm2 [0]

.....

}

2.ShufpdXMM, XMM/m128, imm8 (0 ~ 255)

Description:

From the instruction suffix, This Is A sse2 instruction.

Imm8 (Operation Value) = imm8 (input value) mod 4

Divide the source memory and destination register by four-character 64-bit, and divide them by four binary bits (0 ~ 1, 0 ~ 1, 0 ~ 1, 0 ~ 1) Specify the arrangement,
The memory variable address must be aligned with 16 bytes. The destination register has a high 64-bit source memory number specified, and the destination register has a low 64-bit destination Memory number specified.
High 64-bit | low 64-bit
Destination register: A (1) | A (0)
Source register: B (1) | B (0)
Destination register arrangement result: B (0 ~ 1) | A (0 ~ 1)
Example:
When xmm0 = 0x1111111122222222 3333333344444444
Xmm1 = 0x5555555566666666 aaaaaaaacccccccc,

Run shufpd xmm0, xmm1, 101001 1 0 B

Because 101001 1 0 B mod 4 (101001 1 0 B & 11B), the operation value is 1 0b,

High 1: select the 1st-bit 5555555566666666 of the source register xmm1,

Low 0 select the 0th-bit 3333333344444444 of the destination register xmm0.

Xmm0 = 5555555566666666 3333333344444444 H

3.PshuflwXMM, XMM/m128, imm8 (0 ~ 255)

Description:

First, the high 64-bit content of the source memory is sent to the High 64-bit of the destination register, and then the low 64-bit 4 characters of the source memory are selected with imm8
The destination register is 64-bit low. The memory variables must be 16 bytes aligned with the memory.

64-bit low
64-bit low source register: B (11) | B (10) | B (01) | B (00)
Result of 64-bit low destination register arrangement: B (00 ~ 11) | B (00 ~ 11) | B (00 ~ 11) | B (00 ~ 11)

Example:
When xmm0 = 0x1111111122222222 3333 4444 5555
Xmm1 = 0x5555555566666666 7777 8888 9999 CCCC,

Run pshuflw xmm0, xmm1, 10 10 01 10 B
Xmm0 = 0x5555555566666666 8888 8888 9999

 

4.PshufhwXMM, XMM/m128, imm8 (0 ~ 255)

Description:

First, the low 64 bit content of the source memory is sent to the low 64 bit of the destination register, and then the high 64 bit 4 characters of the source memory are selected with imm8
The destination register is 64-bit high. The memory variables must be 16 bytes aligned with the memory.
High 64-bit
64-Bit High source register: B (11) | B (10) | B (01) | B (00)
Destination register high 64-bit arrangement result: B (00 ~ 11) | B (00 ~ 11) | B (00 ~ 11) | B (00 ~ 11)
Example:
When xmm0 = 0x3333 4444 5555 6666
Xmm1 = 0x7777 8888 9999 CCCC 5555555566666666,

Run pshufhw xmm0, xmm1, 10 10 01 10 B
Xmm0 = 0x8888 8888 9999 8888

5.PshufdXMM, XMM/m128, imm8 (0 ~ 255)

Description:

The four dual-characters of the source memory are selected from the destination Register specified by imm8. The memory variables must be 16 bytes aligned with the memory.
High 64-bit | low 64-bit
Source register: (11) | B (10) | B (01) | B (00)
Destination register arrangement result: B (00 ~ 11) | B (00 ~ 11) | B (00 ~ 11) | B (00 ~ 11)
Example:
When xmm1 = 0x11111111 22222222 33333333 44444444,

Run pshufd xmm0, xmm1, 11 01 01 10b
Xmm0 = 0x11111111 33333333 33333333

 

Summary:

1. The result of shufps and shufpd commands is related to the source register and destination register.

2. The results of the pshuflw, pshufhw, and pshufd commands are irrelevant to the target register.

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.