SSE instruction Set

Source: Internet
Author: User
Tags array definition
      SSE and SSE2 instruction system is very similar, SSE2 than SSE is only a small number of additional floating-point processing capabilities, 64-bit floating-point operation support and 64-bit integer operation support. Why is       SSE faster than traditional floating-point operations? Since it uses 128-bit storage units, this can save 4 for 32-bit floating-point numbers, that is, all calculations in SSE are done for 4 floating-point numbers at a one-time.       Although SSE is theoretically faster than conventional floating-point operations, it is subject to a lot of restrictions, first of all, although it performs four times as fast as a traditional floating-point operation, 4 times faster, But the speed of its execution is not as fast as expected, so to reflect the speed of SSE, must have a stream to do the prerequisite, is a large number of streaming data, so as to play a powerful role of SIMD. Second, SSE supports data types of 4 32-bit (total 128-bit) floating-point numbers, which are float[4 in C and C + + languages, and must be aligned with 16-byte boundaries. So this also brings a lot of trouble to the input and output, in fact, the main effect of SSE performance is to continuously copy the data to apply its data format.       If you are a C + + programmer, not very familiar with the assembly, but also want to use SSE to optimize the program, how to do it. Fortunately VC + + provides us with a very convenient command C function-level package and C-format data types, we just as usual write C + + code to define variables, call functions can be well applied SSE instructions.       Of course, we need to include a header file that includes a declaration of the data types and functions we need:     #include <xmmintrin.h>   The standard data type of SSE operation is only one, that is: __m128, it is defined in this way:    typedef struct __declspec (intrin_type) __declspec (align) __m128 {   float m128_f32[4];} __m128;   Simplified, that is:      struct __m128 {   float m128_f32[4];};   For example, to define a __m128 variable and assign it four float integers, you can write this:    __m128 S1 = {1.0f, 2.0f, 3,0f, 4,0f};   To change the 2nd (base 0) element, you can write this:    s1.m128_f32[2] = 6.0f;   We will also use a number of assignment instructions, which allows us to use this data structure more conveniently:   &NBSP;S1 = _MM_SET_PS1 (2.0f);   It will give the four elements in the S1.M128_F32 all 2.0f, which will be much faster than you can assign a value.   &NBSP;S1 = _mm_setzero_ps ();   This will cause all 4 floating-point numbers in S1 to zero.   There are some other assignment instructions, but do not perform their own assignment to the fast, only for some special purposes, if you want to learn more information, you can refer to the MSDN-> Visual Studio-> Reference-> C + + Language-> compiler intrinsics-> MMX, SSE, and SSE2 intrinsics-> Stream SIMD Extensions (SSE) chapters.         Generally speaking, all SSE instruction functions are composed of 3 parts, the middle is separated by an underscore:   &NBSP;&NBSP;_MM_SET_PS1     MM indicates a multimedia extended instruction set    set represents the meaning of this function abbreviation   PS1 represents the effect of the function on the result variable, consisting of two letters, the first letter represents how the result variable is affected, and p represents the result as a pointer to a set of data. Each element participates in the operation, and S indicates that only the first element in the result variable participates in the operation, and the second letter represents the data type of the participating operation. s represents 32-bit floating-point numbers, D represents 64-bit floating-point numbers, I32 represents a 32-bit fixed-point number, and i64 represents a 64-bit fixed-point number.   Then give an example to illustrate SSE's instruction function is how to use, must be explained that my following code is written on the VC7.1 platform, do not guarantee for other such as dev-c++, Borland C + + and other development platform fully andCapacity.            in order to facilitate the speed of comparison, the conventional method and SSE optimization will be written out, and a test speed of the class Ctimer to the timing.   This algorithm is to enlarge a set of float value, function ScaleValue1 is optimized with SSE instruction, function ScaleValue2 is not. We use 10,000 elements of float array data to test these two algorithms, each algorithm operation 10,000 times, the following is the test procedures and results:     #include <xmmintrin.h> #include < Windows.h>   class Ctimer {public:        __forceinline ctimer (void)  &nbsp ;     {              QueryPerformanceFrequency (&m_frequency);               QueryPerformanceCounter (&m_ Startcount);       }          __forceinline void Reset (void) & nbsp;      {               QueryPerformanceCounter (&m_startcount);       }          __forceinline double End (void)      & nbsp {              static __int64 Ncurcount;                QueryPerformanceCounter (Plarge_ INTEGER) &ncurcount);               return Double (Ncurcount * (__ int64*) &m_startcount))/double (* (__int64*) &m_frequency);         }   Private:        large_integer m_ Frequency;        Large_integer M_startcount;  };   void ScaleValue1 (float *parray, DWORD dwcount, float Fscale) {       DWORD Dwgroup Count = DWCOUNT/4;        __m128 E_scale = _mm_set_ps1 (Fscale);        for (DWORD i = 0; i < Dwgroupcount; i++)        {    &nbsp ;         * (__m128*) (Parray + i * 4) = _mm_mul_ps (* (__m128*) (PArray + i * 4), E_scale);       }   void ScaleValue2 (float *parray, DWORD dwcount, float fscale) {&NBSP;&NB sp;     for (DWORD i = 0; i < dwcount; i++)        {&NBSP;&NBSP;&N bsp;           Parray[i] *= Fscale;       }  }     #define ARRAYCOUNT 10000   int __cdecl main () { & nbsp;     float __declspec (align) array[arraycount];        memset (Array, 0, sizeof (float) * arraycount);        Ctimer t;        double dTime;      &nBsp T.reset ();      for (int i = 0; i < 100000 i++)        {    & nbsp;         ScaleValue1 (Array, Arraycount, 1000.0f);       }          dTime = T.end ();        cout << "use SSE:" << dTime << "SEC" << Endl;        T.reset ();          for (int i = 0; i < 100000 i++)        {& nbsp;             ScaleValue2 (Array, ARRAYCOUNT, 1000.0f);       }          dTime = T.end ();        cout << "not use SSE:" << dTime << "SEC" << Endl;        System ("pause");        return 0;  }   Use sse:0.997817 sse:2.84963            Here's the note, __declspec (align (16)) is used here as a modifier for the array definition, which means that the array is aligned with a 16-byte boundary because the SSE directive can only support memory data in this format.       SSE cvtsi2ss– converts a 64-bit signed integer to a floating-point value and inserts it into a 128-bit argument. Internal instruction: _mm_cvtsi64_ss cvtss2si– takes out a 32-bit floating-point value and rounding (rounded) to a 64-bit integral type. Internal instruction: _mm_cvtss_si64 cvttss2si– takes out a 32-bit floating-point value and truncates it to a 64-bit integer. Internal instruction: _mm_cvttss_si64 SSE2 cvtsd2si– take out the lowest bit of 64-bit floating-point value and rounding to an integral type. Internal instruction: _mm_cvtsd_si64 cvtsi2sd– takes out the lowest bit 64-bit integer and converts it to a floating-point value. Internal instruction: _mm_cvtsi64_sd cvttsd2si– takes out a 64-bit floating-point value and truncates it to a 64-bit integer. Internal instruction: _mm_cvttsd_si64 movnti– writes 64-bit data to a specific memory location. Internal instruction: _mm_stream_si64 movq– moves a 64-bit integer to a 128-bit parameter, or moves a 64-bit integer from a 128-bit argument. Internal instruction: _mm_cvtsi64_si128, _mm_cvtsi128_si64 SSSE3 pabsb/pabsw/pabsd– take the absolute value of the signed integer type. Internal directives: _mm_abs_epi8, _mm_abs_epi16, _MM_ABS_EPI32, _mm_abs_pi8, _mm_abs_pi16, _mm_abs_pi32 combine two parameters and move the results right. Internal directives: _mm_alignr_epi8, _mm_alignr_pi8 phaddsw– Add two parameters containing 16-bit signed integers and try to make the result 16-bit maximum. Internal directive: _mm_hadDS_EPI16, _mm_hadds_pi16 phaddw/phaddd– adds two parameters that contain a signed integral type. Internal directives: _mm_hadd_epi16, _MM_HADD_EPI32, _mm_hadd_pi16, _mm_hadd_pi32 phsubsw– subtract two parameters containing 16-bit signed integers, and try to make the result 16-bit maximum that can be represented. Internal instruction: _mm_hsubs_epi16, _mm_shubs_pi16 phsubw/phsubd– subtracts two parameters that contain signed integers. Internal instruction: _mm_hsub_epi16, _MM_HSUB_EPI32, _mm_hsub_pi16, _mm_hsub_pi32 pmaddubsw– Multiply and add 8-bit integer. Internal instruction: _mm_maddubs_epi16, _mm_maddubs_pi16 pmulhrsw– multiplied by 16-bit signed integer, and the result is shifted to the right. Internal instruction: _mm_mulhrs_epi16, _mm_mulhrs_pi16 pshufb– Select from a 128-bit parameter and randomly sequence 8 bits of data block. Internal instruction: _mm_shuffle_epi8, _mm_shuffle_pi8 psignb/psignw/psignd– negation (take the non), take 0, or retain the signed integer type. Internal directives: _mm_sign_epi8, _mm_sign_epi16, _MM_SIGN_EPI32, _mm_sign_pi8, _mm_sign_pi16, _mm_sign_pi32 SSE4A, extrq– from the parameters of the specific positioning. Internal directives: _mm_extract_si64, _mm_extracti_si64 insertq– inserts are positioned to the given parameters. Internal instruction: _mm_insert_si64, _mm_inserti_si64 movntsd/movntss– do not use caching, directly write data bits to a specific memory location. Internal instruction: _MM_STREAM_SD, _MM_STREAM_SS SSE4.1 dppd/dpps– computes the point result of two parameters. Internal instruction: _MM_DP_PD, _mm_dp_ps extractps– takes a specific 32-bit floating-point value out of the parameter. Internal instruction: _mm_extract_ps insertps– inserts a 32-bit integer into a 128-bit parameter and places zeros. Internal instruction: _mm_insert_ps Movntdqa– loads 128-bit data from a specific memory location. Internal instruction: _mm_stream_load_si128 mpsadbw– calculates an absolute difference of eight offset totals. Internal instruction: _mm_mpsadbw_epu8 packusdw– uses 16-bit saturation to convert a 32-bit signed integer to a signed 16-bit integer type. Internal instruction: The _MM_PACKUS_EPI32 pblendw/blendpd/blendps/pblendvb/blendvpd/blendvps– mixes two parameters of different block sizes together. Internal directives: _mm_blend_epi16, _MM_BLEND_PD, _mm_blend_ps, _mm_blendv_epi8, _MM_BLENDV_PD, _mm_blendv_ps PCMPEQQ-Compare 64-bit integers for equality. Internal instruction: _mm_cmpeq_epi64 PEXTRB/PEXTRW/PEXTRD/PEXTRQ-Extracts an integer from the input parameters. Internal directives: _mm_extract_epi8, _mm_extract_epi16, _MM_EXTRACT_EPI32, _mm_extract_epi64 phminposuw-Select the smallest 16-bit unsigned integer and determine its subscript. Internal instruction: _mm_minpos_epu16 PINSRB/PINSRD/PINSRQ-Inserts an integer into a 128-bit parameter. Internal directives: _mm_insert_epi8, _MM_INSERT_EPI32, _mm_insert_epi64 pmaxsb/pmaxsd-accepts signed integers in two parameters and selects the largest of them. Internal directives: _mm_max_epi8, _mm_max_epi32 pmaxuw/pmaxud-accepts unsigned integers in two parameters and selects the largest of them. Internal directives: _mm_max_epu16, _mm_max_epu32 PMINSB/PMINSD-accepts signed integers in two parameters and selects the smallest of them. Internal directives: _mm_min_epi8, _mm_min_epi32 pminuw/pminud-accepts unsigned integers in two parameters and selects the smallest of them. Internal instruction: _mm_min_epu16, _mm_min_epu32 pmovsxbw/pmovsxbd/pmovsxbq/pmovsxwd/PMOVSXWQ/PMOVSXDQ-Converts a signed integer to a larger size. Internal directives: _mm_cvtepi8_epi16, _MM_CVTEPI8_EPI32, _mm_cvtepi8_epi64, _MM_CVTEPI16_EPI32, _mm_cvtepi16_epi64, _mm_cvtepi32_ epi64 PMOVZXBW/PMOVZXBD/PMOVZXBQ/PMOVZXWD/PMOVZXWQ/PMOVZXDQ-Converts a unsigned integer to a larger size. Internal directives: _mm_cvtepu8_epi16, _MM_CVTEPU8_EPI32, _mm_cvtepu8_epi64, _MM_CVTEPU16_EPI32, _mm_cvtepu16_epi64, _mm_cvtepu32_ The epi64 PMULDQ-32 is multiplied by a signed integer, and the result is stored as a 64-bit signed integral type. Internal directive: _MM_MUL_EPI32 PMULLUD-32-bit signed integer multiplication. Internal instruction: _MM_MULLO_EPI32 Ptest-computes two 128-bit parameters in bits and returns values based on the CF and ZF bits of the CC label register. Internal directives: _mm_testc_si128, _mm_testnzc_si128, _mm_testz_si128 Roundpd/roundps-rounding floating-point values. Internal directives: _MM_CEIL_PD, _mm_ceil_ps, _MM_FLOOR_PD, _mm_floor_ps, _MM_ROUND_PD, _mm_round_ps Roundsd/roundss-combined with two parameters, From which one is rounded to a floating-point value. Internal directives: _MM_CEIL_SD, _MM_CEIL_SS, _MM_FLOOR_SD, _mm_floor_ss, _MM_ROUND_SD, _MM_ROUND_SS SSE4.2-CRC32 test and calculation parameters. Internal directives: _mm_crc32_u8, _mm_crc32_u16, _mm_crc32_u32, _mm_crc32_u64 pcmpestri/pcmpestrm-compares two parameters of a specific length. Internal directives: _mm_cmpestra, _MM_CMPESTRC, _mm_cmpestri, _MM_CMPESTRM, _mm_cmpestro, _mm_cmpestrs, _mm_cmpestrz PCMPGTQ-Compares two parameters. Internal instruction: _mm_cmpgt_epi64 PCMPISTRI/PCMPISTRM-Compares two parameters. Internal directives: _mm_cmpistra, _MM_CMPISTRC, _mm_cmpistri, _MM_CMPISTRM, _mm_cmpistro, _mm_cmpistrs, _mm_cmpistrz POPCNT- The number of statistical bits in 1. Internal directives: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64 advanced bit manipulation lzcnt-the number of zeros in the statistics parameter. Internal directives: __lzcnt16, __lzcnt, __lzcnt64 popcnt-the number of 1 in the statistical location. Internal directives: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64 other new directives _interlockedcompareexchange128-contrast two parameters. _mm_castpd_ps/_mm_castpd_si128/_MM_CASTPS_PD/_mm_castps_si128/_MM_CASTSI128_PD/_mm_castsi128_ps-32-bit floating-point value (PS), 64 The bit floating-point value (PD) and the 32-bit integer value (si128) are interpreted again. _MM_CVTSD_F64-Removes the lowest 64-bit floating-point value from the parameter. _mm_cvtss_f32-Takes out a 32-bit floating-point value. _RDTSCP-Generate RDTSCP. Writes the TSC aux[31:0] to memory and returns the 64-bit timestamp counter result.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.