_mm_movemask_ps's Arm-neon implementation.

Source: Internet
Author: User

Don't say much nonsense. When optimizing vector operations with SIMD instruction sets, it is important to use _MM_MOVEMASK_PS to get the results of comparisons.

But Arm-neon did not provide this instruction.

Reference: Http://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon

I didn't test the results in this post, but it's not the best thing to do from the implementation. Its performance, robustness is not as good as the Directxmath

The following is the implementation of the Directxmath library:

uint8x8x2_t vtemp = vzip_u8 (vget_low_u8 (Vresult), Vget_high_u8 (Vresult));             = Vzip_u16 ((uint16x4_t) vtemp.val[0], (uint16x4_t) vtemp.val[1]);             return (Vget_lane_u32 (vtemp1.val[110xFFFFFFFFU);

Here we use 2 zip to mix 4 components, and finally make a uint32_t value for each component's high (most sign) to check the results.

For example, the 0xFFFFFFFFU logo all passed. 0XFF0000FF indicates that the Y,z component in the Vector [x,y,z,w] does not pass detection.

This is not a problem in itself, the problem is that the zip instruction overhead is a bit large.

Here is my implementation. Where mask Celementindex is added because the blog is zero. In the actual implementation, these two variables are constexpr.

The final performance on the Lumia 950 XL is about 23% faster than the Directxmath. (using a bit with, bit or, padd instead of two zip).

constexprConstuint32_t celementindex[4]{1,2,4,8};StaticInline uint32_t vmaskq_u32 (uint32x4_t&CR) {    Static Constuint32x4_t Mask =vld1q_u32 (Celementindex); //extract element Index bitmask from compare result.uint32x4_t vtemp =vandq_u32 (CR, Mask); uint32x2_t VL= Vget_low_u32 (vtemp);//Get low 2 UInt32uint32x2_t VH = vget_high_u32 (vtemp);//get high 2 uint32VL =vorr_u32 (VL, VH); VL=Vpadd_u32 (VL, VL); returnVget_lane_u32 (VL,0); }

In my implementation, add an extra thing, that is celementindex {1,2,4,8} can actually use {1,1,1,1} or other non-0 values can be.

However, the purpose of {1,2,4,8} is to make it easier to understand the final results and to conform to the Movemask specification.

For example, if (Vmaskq_u32 (Intersectionresult) & (0x6))//check intersectionresult[x,y,z,w] The Y,z component is passed.

Back in the other one is the pit of the thing to paste it out. That's _mm_div_ps's neon implementation (two Newton interpolation ...).

The Arm-neon implementation of the

_mm_movemask_ps.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.