New Progress in FFT Lens Effect

Source: Internet
Author: User
Document directory
  • Data format
  • Pure real number FFT
  • Sfft
Reprinted please indicate the source for the klayge game engine, the permanent link of this article is http://www.klayge.org /? P = 1920

The FFT lens effect has been completed and integrated into the klayge development edition. It is named fftlenseffectpostprocess. In addition, a command line tool is also written for generating the texture of the lens effect. Although the FFT method can produce a variety of complex Lens Effects in a pass, the current performance is lower than the default multi-Gaussian blur. The most important overhead is the FFT itself. Next we will focus on the progress and future of gpu fft.

Not long ago, when I used pixel shader to implement FFT, I mentioned that using compute shader to implement FFT is more efficient. In the previous ocean example, cs4 FFT is used for wave simulation (improved from the Nv example), and its input and output are in the 1D buffer format. To make it more common, it is changed to 2D texture when entering the klayge core. Unfortunately, cs4 cannot write textures, so we need to add a PS to convert the buffer into a texture. The implementation of cs5 is basically similar to cs4, but it is changed to direct read/write. Now the FFT of cs4 and cs5 has been implemented, so we can finally compare the performance.

Algorithm differences

Due to output restrictions, the FFT of pixel shader can only implement the simplest Radix-2 algorithm. Two plural numbers are input each time and two plural numbers are generated. Compute shader supports arbitrary writing, so it can implement more complex Radix-4, Radix-8 or even multi-radix. Here I implement Radix-8, which processes 8 plural numbers at a time. Therefore, 512*512 requires only 6 passes, which is much less than 18 passes in PS.

Another difference is that due to cs4 restrictions, cs4 FFT data is stored in float2 buffer (real and virtual), with only three channels, the RGB values of each pixel are arranged as R0/R1/R2... Rn G0/G1/G2... GN B0/B1/B2... BN format. In the implementation of PS and cs5, The r01_b0a0/r1g1b1a1/r2g2b2a2... Rngnbnan arrangement with 4 channels. Therefore, for computing purposes alone, cs4 has only one channel, and the amount of scalar computing and bandwidth consumption will be less than PS and cs5.

Performance Comparison

The input here is 512x512, which compares forward FFT and reverse IFFT on the high-end, middle-end, and low-end graphics cards respectively.

NV gtx580 FFT/IFFT (MS) AMD hd6670 FFT/IFFT (MS) NV 9800gt FFT/IFFT (MS) NV nvs4200m FFT/IFFT (MS)
PS 2.12/2.28 7.09/7.22 7.99/8.16 33.30/33.30
Cs4 0.70/0.70 2.44/2.47 3.33/3.45 10.36/9.80
Cs5 0.76/0.76 2.40/2.39 Not Supported 8.86/8.81

From the results, we can see that the implementation of cs4 is about two times faster than that of PS. On the one hand, the computing workload resulting from Radix-8 is reduced, and on the other hand, the number of pass requests is reduced. Interestingly, besides the high-end gtx580, cs5 on other cards is slightly faster than cs4. On the Nv card, the IO efficiency in 64-bit format is higher than that in other formats, which also brings some advantages to cs4.

This result also shows that cs5 has room for optimization. Cs5 does more than 1/3 of the computing, and the IO efficiency is not the highest, but it can be similar to or even faster than cs4.

Future

In this version, the effect of the FFT lens has come to an end temporarily. In addition to further optimization, there will be many algorithm-level improvements worth trying.

Data format

Now all the intermediate formats are abgr32f. If abgr16f is used, some artifact will appear in the dark. If such artifact is within the acceptable range, or other compensation methods are used to improve the accuracy of 16f, the speed can be greatly improved. The input and output formats are unrestricted. The input format can be abgr16f or less b10g11r11f, and the output format can be abgr16f.

Pure real number FFT

Because the input data is an image with only the real number, the result of FFT on the pure real number is symmetric at the beginning and end. This can be used to improve the algorithm. In fftw, this transformation is called r2c. For an input of N length, it only needs to output n/2 + 1 plural, instead of N. Both Io and computation can be reduced. Another attempt direction is DCT. As a special case of FFT, the DCT with special addressing arrangement also conforms to the convolution theorem. Therefore, the input and output are pure real-number DCT, which can also reduce Io.

Sfft

In January this year, a mit team proposed the spare Fast Fourier Transform (sfft) algorithm. When the input signal is sparse, sfft can be times faster than the traditional FFT. The input for Lens Effects is only pixel whose brightness is greater than a certain threshold, and the degree of sparsity is very high. Therefore, sfft may be used for acceleration.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.