The realization and optimization of x264 's Hada transform (Hadamard)

Last Update:2018-07-25 Source: Internet

Author: User

Tags abs diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadamard Transformation Theory

A lot of pages have been introduced, I will not copy, to two links.

The following is a section of the Hada transformation in the "Computer image processing analysis" Courseware of Harvey Mudd College.

(JASON Garrett-glaser x264 's main development was to study at this school.) Great for a engineering university)

Http://fourier.eng.hmc.edu/e161/lectures/wht/index.html

Approximate outline:

1. The definition of Hadamard matrix is introduced.

2. Fast Hadamard transform algorithm (Hadamard orderd)

3. Definition of sequency Ordered Hadamard matrix--h264 used this definition

4. Fast Hadamard transform algorithm (sequency orderd)

Http://www.cnblogs.com/xkfz007/articles/2616143.html

This is a Chinese blog "SATD Implementation analysis in X264"

The advantage is-Chinese, haha. Step is also more detailed, the disadvantage is about h264 use of sequency ordered that part of the unclear. the realization of Hadamard in JM86

In the function SATD

If the Hadamard transformation is used, then the Hadamard transformation of the residual block is performed first, and then the 16 residual values of the transformed Use_hadamard are added as the cost of the absolute value.
{
/*===== Hadamard Transform =====*/
m[0] = d[0] + d[12];
m[4] = d[4] + d[8];
m[8] = d[4]-d[8];
M[12] = d[0]-d[12];
m[1] = d[1] + d[13];
m[5] = d[5] + d[9];
m[9] = d[5]-d[9];
M[13] = d[1]-d[13];
m[2] = d[2] + d[14];
m[6] = d[6] + d[10];
M[10] = d[6]-d[10];
M[14] = d[2]-d[14];
m[3] = d[3] + d[15];
m[7] = d[7] + d[11];
M[11] = d[7]-d[11];
M[15] = d[3]-d[15];

d[0] = m[0] + m[4];
d[8] = m[0]-m[4];
d[4] = m[8] + m[12];
D[12] = m[12]-m[8];
d[1] = m[1] + m[5];
d[9] = m[1]-m[5];
d[5] = m[9] + m[13];
D[13] = m[13]-m[9];
d[2] = m[2] + m[6];
D[10] = m[2]-m[6];
d[6] = m[10] + m[14];
D[14] = m[14]-m[10];
d[3] = m[3] + m[7];
D[11] = m[3]-m[7];
d[7] = m[11] + m[15];
D[15] = m[15]-m[11];

m[0] = d[0] + d[3];
m[1] = d[1] + d[2];
m[2] = d[1]-d[2];
m[3] = d[0]-d[3];
m[4] = d[4] + d[7];
m[5] = d[5] + d[6];
m[6] = d[5]-d[6];
m[7] = d[4]-d[7];
m[8] = d[8] + d[11];
m[9] = d[9] + d[10];
M[10] = d[9]-d[10];
M[11] = d[8]-d[11];
M[12] = d[12] + d[15];
M[13] = d[13] + d[14];
M[14] = d[13]-d[14];
M[15] = d[12]-d[15];

d[0] = m[0] + m[1];
d[1] = m[0]-m[1];
d[2] = m[2] + m[3];
d[3] = m[3]-m[2];
d[4] = m[4] + m[5];
d[5] = m[4]-m[5];
d[6] = m[6] + m[7];
d[7] = m[7]-m[6];
d[8] = m[8] + m[9];
d[9] = m[8]-m[9];
D[10] = m[10] + m[11];
D[11] = m[11]-m[10];
D[12] = m[12] + m[13];
D[13] = m[12]-m[13];
D[14] = m[14] + m[15];
D[15] = m[15]-m[14];

/*===== sum up =====*/
for (dd=diff[k=0]; k<16; Dd=diff[++k])
{
SATD + = (dd < 0?-dd:dd);
}
SATD >>= 1;
}

The algorithm process is clear, see "Fast Hadamard Transform algorithm (Hadamard orderd)" above.

May be a bit unclear, most of the information is about one-dimensional hada, the two-dimensional is only to explain the first transformation of two-dimensional image lines, and then on the basis of the transformation of the column. I was confused for a long time (to eat walnuts), and now see this realization, it should be very clear, is actually counted two times,

For the first time, we divide the 4x4 horizontally into 4 1x4 vectors, do one-dimensional transformation, do a total of four times, so that the transformation is done by the line,

The second time, the resulting 4x4 matrix is then vertically divided into 4 1x4 vectors, making one-dimensional transformations. Do it four times in total. This completes the transformation of the two-dimensional. x264 non-optimized C language version of Hadamard transform implementation

In version commit 5dc0aae2f900064d1f58579929a2285ab289a436, which is the first version

In the function Pixel_satd_wxh

for (d = 0; d < 4; d++)
{
int S01, S23;
int d01, D23;

S01 = diff[d][0] + diff[d][1]; S23 = diff[d][2] + diff[d][3];
D01 = diff[d][0]-diff[d][1]; D23 = diff[d][2]-diff[d][3];

Tmp[d][0] = S01 + s23;
TMP[D][1] = s01-s23;
TMP[D][2] = d01-d23;
TMP[D][3] = d01 + D23;
}
for (d = 0; d < 4; d++)
{
int S01, S23;
int d01, D23;

S01 = Tmp[0][d] + tmp[1][d]; S23 = Tmp[2][d] + tmp[3][d];
D01 = Tmp[0][d]-tmp[1][d]; D23 = Tmp[2][d]-tmp[3][d];

I_SATD + = ABS (S01 + S23) + ABS (S01-S23) + ABS (D01-D23) + ABS (D01 + D23);
}
You can see that the algorithm is somewhat different from the JM86, because he used Hada sequency orderd.

To tell the truth, in the courseware of "fast Hadamard transform algorithm (sequency orderd)" explanation, I did not read very understand AH. A master can explain the next. But this code is generally not difficult to understand, but also a typical disk-shaped algorithm. But the order and JM86 are not the same estimate is because the use of sequency ordered reasons. I understand that the reason for this ordered is that Sequency ordered is re-shot in the order of the Hadamard transformation matrix's symbolic transformation. The smallest change in the top of the mouth, the most in the bottom column. This should be more energy concentrated in the upper left corner, more convenient for the compression behind. x264 optimize the C language version of the Hadamard transformation implementation See Patch "1.6x faster Satd_c (and sa8d and Hadamard_ac) with PSEUDO-SIMD."

static noinline int x264_pixel_satd_4x4 (pixel *pix1, intptr_t i_pix1, Pixel *pix2, intptr_t i_pix2)
{
sum2_t tmp[4][2];
sum2_t A0, A1, A2, A3, B0, B1;
sum2_t sum = 0;
for (int i = 0; i < 4; i++, pix1 + = i_pix1, pix2 + = i_pix2)
{
a0 = pix1[0]-pix2[0];
A1 = Pix1[1]-pix2[1];
B0 = (A0+A1) + ((A0-A1) <<bits_per_sum);
A2 = pix1[2]-pix2[2];
A3 = pix1[3]-pix2[3];
B1 = (A2+A3) + ((A2-A3) <<bits_per_sum);
Tmp[i][0] = B0 + B1;
TMP[I][1] = B0-B1;
}
for (int i = 0; i < 2; i++)
{
HADAMARD4 (A0, A1, A2, A3, Tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]);
a0 = ABS2 (A0) + ABS2 (A1) + ABS2 (A2) + ABS2 (A3);
Sum + = ((sum_t) A0) + (A0>>bits_per_sum);
}
return sum >> 1;
}
Careful observation can be found that the fast Hada transformation algorithm is symmetrical.

So the main idea of this optimization is that when the first trip is transformed, the data is rearranged, the highs and lows are placed in the first transformed data,

(similar to tmp[i][0] = x|x; tmp[i][1] = x|x such a layout) so that in the subsequent column transformation can be parallel processing of two of data

A normal 4x4 hada? The complexity of the fast-changing algorithm is 8 (one-dimensional hada transform addition times) (line) + 8 * 4 (column transform) =64

And the new algorithm is 4*8 + 2*8 = 48 times. 1.3 times times faster than acceleration

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More