Optimization of SALVIA 0.5.2

Source: Internet
Author: User

Summary The optimization experience of SALVIA 0.5.2 is a "ups and downs" process. The result of this process is very simple: on Core 2 Duo T5800 (2.0 GHz x 2), the performance of the ComplexMesh is improved by 60% and the performance of the ComplexMesh is improved by 26%. Background The whole rendering process of SALVIA is mainly divided into the following parts: Obtain the Vertex to be transformed according to the Index Buffer; Use Vertex Shader to transform the Vertex; and the transformed Vertex, output to several float4; raster the triangle. SALVIA's raster is to split a triangle into several 4x4 pixel blocks, with a mask to deal with less than the blocks, and interpolation of the pixels; after the value is inserted, the Pixel is sent to the Pixel Shader for processing. The processed result is inserted into the Back buffer using the Blend Shader. Use Cases for testing: about 20 Diffuse textures (0.26 million x 1024), about 1024 PartOfSponza, and 4 Diffuse textures (200x1024 ); the ComplexMesh has 20 thousand faces, no textures, and has a conservative illumination. In the initial version (V1231), the main performance bottleneck is in the interpolation phase, with a total of more than half of the time consumption (50%-70% ). In contrast, the impact of other stages on performance is either limited or there is little room for optimization. So the optimization in the last week is concentrated on "interpolation. There are two common implementations of linear interpolation algorithms: UV interpolation and ddx and ddy accumulation. UV is to calculate the u and v of the pixel first (the basic method is to use area ratio. If you don't remember, review the Middle School ry), and then use the interpolation formula: pixel = v0 * u + v1 * v + v2 * (1-u-v) the latter step is to select a master vertex and then calculate the ddx and ddy of the vertex, finally, use pixel = v0 + ddx * offset_x + ddy * offset_y to calculate the corresponding vertex. However, in graphics, we also need to perform perspective correction on interpolation to obtain linear interpolation results in 3D space. We fixed the step to the perspective space: First we got v0, v1, v2 into the perspective space and changed it to projected_v0, projected_v1, projected_v2's UV interpolation is pixel = (projected_v0 * u + projected_v1 * v + projected_v2 * (1-u-v)/pixel_w's cumulative formula for using ddx and ddy is: pixel = (projected_v0 + projected_ddx * offset_x + projected_ddy * offset_y)/pixel_w He Yong (Graphixer) also wrote a Renderer, it is much faster than me (about 4-6 times), using UV; The Renderer written by gameKnife two weeks ago is seven times faster than the half-products I wrote in five years, the solution is to use Lerp on Scanline and then Lerp to pixel. SALVIA adopts the accumulation method: struct detail {float4 attributes [attributes] ;}; export projected_corner; // calculates the coordinate of the corner points = projected_v0 + projected_ddx * offset_x + projected_ddy * offset_y; // pixel perspective correction value float inv_w; // The final output 4x4 pixel pixel_input px_in [4] [4]; for (int I = 0; I <4; ++ I) {projected_pixel = projected_scanline_start; for (int j = 0; j <4; ++ j) {// convert the pivot space Linear Space and output to px_in [I] [j] = unproject (projected_pixel); // accumulate values in the x direction (pivoting space) projected_pixel + = projected_ddx ;} // accumulate values in the y direction (pivoting space) projected_scanline_start + = projected_ddy;} The MAX_ATTRIBUTE_COUNT value is usually relatively large when optimizing the interpolation algorithm before this round of optimization, in v1231, It is 32. However, we obviously do not need to calculate all attributes. Min used a small technique to optimize it: only calculate the necessary attributes. At the same time, to reduce the use of the branch, he even uses the template <int N> void sub_n (out, v0, v1) {for (int I = 0; I <N; ++ I) {out. attributes [I] = v0.attributes [I]-v1.attributes [I] ;}and use the function pointer method to enable the compiler to expand the loop and reduce the branch. However, from the perspective of the actual compilation, this part is not expanded to the desired form. It may be because the compiler thinks that the Branch Predication performance of x86 is high enough. This "optimization" is available in v1231. First round of optimization: unproject function, operator + = and operator = the first Profiling is run with BenchmarkPartOfSponza and Sponza; unproject, the sum of operator + = and operator = takes about 15-20% of the time. The initial implementation of a separate unproject is a common scalar. Neither require alignment nor use SIMD. So of course we will think that after SIMD is used, the optimization results will be very good. Therefore, in v1232, the allocation of the intermediate vertex and pixel input is 16-byte alignment, and the unproj, + =, and = are also overwritten using SSE. In terms of running points, the performance of partof0000za has increased by 20%. However, no significant improvement in frame rate was found during the test of ComplexMesh and Sponza. In fact, he told me before optimization, because some technologies of modern CPU, such as excessive CPU usage, compare the SSE of four data widths with scalar operations, there is only 50% performance gap. In addition, the commands of these functions are extremely simple, and the bottleneck also falls clearly on the computing commands. For example, after Unproject optimization, the performance focus is on _ mm_mul_ps (3.7%), and there is no room for optimization. Two-round optimization: the adjustment of the interpolation algorithm also runs a Profiling before the second round of optimization. Because we are basically satisfied with the performance of partof0000za, the goal of this optimization is mainly to improve the performance of ZA. The top few small functions, namely sub_n, unproj, + =, and tex2D. After routine optimization of sub_n, the performance remains unchanged. Of course, this is also expected. Therefore, the second round of optimization focuses on the interpolation algorithm itself. Before optimization, I tried to make a rough assessment of the Code cost: under the existing algorithm, assuming that each pixel has N attributes to be interpolated, each pixel has an average (corner) 3N/16 read + 2N/16 multiplication + 2N/16 addition + N/16 Write (x: + =) 2N reads + N additions + N writes (x: *) N reads + 1 scalar division + N multiplication + N writes (y: + =) 2N/4 read + N/4 addition + N/4 Write (y: =) N/4 read + N/4 write because each is a function pointer, therefore, these are not optimized. Therefore, we first merge some operations, such as + = and * to reduce the read/write operations. It is a pity that the effect is not very obvious. The second is cut to the algorithm head. Because the accumulation itself is used to reduce multiplication, it may lead to excessive access overhead. Therefore, directly apply the formula pixel = (projected_v0 + projected_ddx * offset_x + projected_ddy * offset_y)/pixel_w: 3N read, 2N multiplication, 2N addition, N multiplications and N writes (assuming the register is sufficient ). It is not counted as the computing cost of the Corner. In this case, it is equal to 3N/4 reads, N/2 + N writes, and N/4 Additions in exchange for 2N multiplication time. I thought it was an I/O bottleneck application, which could improve some performance. However, the results show that this purchase is really not cost-effective, and the overall performance will not increase or decrease. Three rounds of optimization: reduce memory usage. Although all the operations are only for the used attributes, the space is wasted. MAX_ATTRIBUTE_COUNT is reduced from 32 to 8 because the memory usage is large and may cause some performance loss. The results were astonishing. The performance was instantly improved by 20-30%. In addition, SSE does not know why it has started to work. After using SSE, its performance has improved by 10-15%. I guess it may be because the page feed frequency drops and the Cache hit rate increases. However, you do not have a tool like VTune, so it is not very well verified. Four-wheel optimization: the extra bonus of reduced precision sensitivity after this round of optimization, partof0000za encountered a precision problem. Because no Clip exists on the upper, lower, and left sides of the cone, a very large triangle may appear. In this way, once the starting point is not selected properly, a large error may occur. In earlier versions, use/fp: precise to reduce the chance of this problem. However, SSE makes this problem more difficult to solve. Therefore, I chose some methods to improve the accuracy. After all the major problems were corrected, the/fp: fast Compilation of the entire SALVIA eventually achieved performance gains of about 0-10%.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.