Don't say much nonsense. When optimizing vector operations with SIMD instruction sets, it is important to use _MM_MOVEMASK_PS to get the results of comparisons.
But Arm-neon did not provide this instruction.
Reference: Http://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon
I didn't test the results in this post, but it's not the best thing to do from the implementation. Its performance, robustness is not as good as the Directxmath
The following is the implementation of the Directxmath library:
uint8x8x2_t vtemp = vzip_u8 (vget_low_u8 (Vresult), Vget_high_u8 (Vresult)); = Vzip_u16 ((uint16x4_t) vtemp.val[0], (uint16x4_t) vtemp.val[1]); return (Vget_lane_u32 (vtemp1.val[110xFFFFFFFFU);
Here we use 2 zip to mix 4 components, and finally make a uint32_t value for each component's high (most sign) to check the results.
For example, the 0xFFFFFFFFU logo all passed. 0XFF0000FF indicates that the Y,z component in the Vector [x,y,z,w] does not pass detection.
This is not a problem in itself, the problem is that the zip instruction overhead is a bit large.
Here is my implementation. Where mask Celementindex is added because the blog is zero. In the actual implementation, these two variables are constexpr.
The final performance on the Lumia 950 XL is about 23% faster than the Directxmath. (using a bit with, bit or, padd instead of two zip).
constexprConstuint32_t celementindex[4]{1,2,4,8};StaticInline uint32_t vmaskq_u32 (uint32x4_t&CR) { Static Constuint32x4_t Mask =vld1q_u32 (Celementindex); //extract element Index bitmask from compare result.uint32x4_t vtemp =vandq_u32 (CR, Mask); uint32x2_t VL= Vget_low_u32 (vtemp);//Get low 2 UInt32uint32x2_t VH = vget_high_u32 (vtemp);//get high 2 uint32VL =vorr_u32 (VL, VH); VL=Vpadd_u32 (VL, VL); returnVget_lane_u32 (VL,0); }
In my implementation, add an extra thing, that is celementindex {1,2,4,8} can actually use {1,1,1,1} or other non-0 values can be.
However, the purpose of {1,2,4,8} is to make it easier to understand the final results and to conform to the Movemask specification.
For example, if (Vmaskq_u32 (Intersectionresult) & (0x6))//check intersectionresult[x,y,z,w] The Y,z component is passed.
Back in the other one is the pit of the thing to paste it out. That's _mm_div_ps's neon implementation (two Newton interpolation ...).
The Arm-neon implementation of the
_mm_movemask_ps.