Gkengine rendering Optimization

Source: Internet
Author: User
Document directory
  • First round, battle against resolution
  • Second round to solve new problems
  • The third round introduces the hybrid rendering pipeline
  •  
  • Rendering Configuration:
  • Test results:
  • Comparison of results:

Gkengine summarized a Development Summary (I) last time ). Then the binary demo was sent to opengpu, which caused a lot of attention.

Link: opengpu post:

Http://www.opengpu.org/forum.php? MoD = viewthread & tid = 15246.

The issue of efficiency has been mentioned by many experts. Therefore, the Development Summary being written (below) has been stopped by me. Carefully analyze the rendering process, find the performance bottleneck, try to modify and break through. Analyze new performance bottlenecks, try to modify and break through...

After the optimization is completed, the Renderer structure has been reconstructed a lot. Next, we should continue to summarize the rendering process more handy.

Digress: the binary demo before gkengine has been uploaded to the project homepage.

Https://gkengine.codeplex.com/

The gkengine project has decided to be open-source and the code is being organized. After the preparation is complete, upload it to the codeplex hosting. Welcome to join us and learn from each other!

Finally, after two weeks of spare time, the efficiency of the demo opening shot was improved to 240%. Some interesting analysis and optimization methods during this period are worth record.

 

 

Misson start

It was a weekend. I first locked the camera in the demo in the program. Then begin to use the Intel GPA graphic analysis tool.

In short, GPA is an Intel graphic analysis tool. Similar to the PIX and NV famous analysis tools perfhud provided by DirectX SDK. The biggest advantage of GPA is its ease of use and high stability. I personally think this is the drawback of PIX and perfhud. Therefore, I use GPa to complete general simple analysis tasks.

The download page for the tool is here: http://software.intel.com/en-us/vcsource/tools/intel-gpa

However, the GPA has been updated to version 2013 R2. But I personally think it is best to use the 4.3 version of history. Good support for nvidia and ATI graphics cards. The support of subsequent versions is not so friendly. In addition, it seems that Intel does not provide previous versions, and you need to search and download them on other websites...

Next, we get precise GPU time data and detailed resources for each stage:

Rendering efficiency: gtx560 104 frame/s 9.6 mS/Frame

This is an analysis of the GPA time of the first shot in the 0425demo version. Brief Analysis: some unreasonable bottlenecks are highlighted as follows:

Ssao: full screen AO computing, which takes half of the overall scene coloring time.

Shadowmaskgen: the generation of shadow mask. The total consumed time is the same as that of ssao.

Postprocess: fog, HDR, DOF, and other post-processing results consume a lot of time.

Reflectmap: the generation of a reflection graph. In this scenario, there is no surface, and the generation of a reflect map should be cropped out.

 

First round, battle against resolution

The above bottlenecks are all post-processing algorithms, all of which are pixel computing-intensive rendering algorithms. Therefore, if we can significantly reduce the complexity of pixel computing while basically ensuring the rendering quality, the performance will be doubled.

1. ssao Optimization

Shader of ssao, assembly command 169, 11 sampling commands. X resolution, executions per frame. Based on the complexity of the scenario, the sampling method of the rotating texture is optimized, and the semi-resolution ssao processing (downsample) is used ). The computing workload can be reduced to 1/4, while the quality reduction is very low.

2. shadowmask Optimization

Using the same policy and semi-dimensional rendering, the calculation workload is reduced to 1/4. However, rendering errors may occur in the subsequent coloring phase.

Because mask uses a half-size, in the full-size coloring phase, the non-shadow value (the edge of the trunk, there are non-shadow white edges in the shadow behind the huts ).

The solution is to sample a shadow value in the lower right pixel of each sampling point in the coloring phase. The minimum value of the two values is used as the shadow value of the current pixel to filter out the rendering error. This solution also has an inevitable drawback: it may produce a certain degree of shadow black edge in its own shadow. However, the defects caused by black edges are completely acceptable.

3. Post Process Optimization

In the previous postprocess, there were a lot of reverse RT operations (purple rectangular blocks ). After reasonable allocation and sequence adjustment of RT, some RT stretch operations can be removed to improve the efficiency of post-process.

4. Ultimate Optimization-variable rendering resolution

Previously, we used downsample rendering for various features to reduce the pixel computing pressure. However, the improvement on the mobile platform is still not obvious. Therefore, an ultimate solution is required. After being tested in Photoshop, the image is rendered in a size of 3/4 and then "zoomed in" to the full size. The final decline in image quality is not too large. However, the pixel computing volume can be directly reduced to nearly 1/2. The performance improvement is significant. Therefore, a scale attribute is added to the Texture Manager. Downsample all textures except backbuffer. After the rendering is complete, stretch goes to the backbuffer.

 

Second round to solve new problems

The first round of optimization has come to an end, and the fight against resolution is over. Rendering efficiency is directly improved by nearly 100%. At that time, I added a post on opengpu, and the efficiency was indeed increased. However, some experts raised the issue of quality decline.

1. Add a sharpening pass for low-resolution Rendering

As shown in, it is true that the rendering resolution of 3/4 is reduced by nearly half of the pixels, And the quality is inevitably reduced. As tan you said, the result is that the image is compressed by quality. But can this decrease be compensated?

After some research, slight resizing can be compensated by sharpening. As a result, I began to try the previous sharpening algorithm to blur the image with a weak Gaussian blur, and then use the fuzzy result to insert it linearly to the source image to strengthen the "contrast" of the pixel to sharpen the image.

Color = lerp (blur, curr, sharpvalue); // The value of sharpvalue is greater than 1.
2. Use manual mipmap to solve the problem of excessive terrain grain sensation

Because the multi-layer mixture of Terrian is calculated directly in the shader, because the repeated texture sampling is obtained directly from the shader through frac, therefore, if you enable mipmap, there will be a sampling error (because the texcoord calculated by Frac is not continuous). Previously, the image disabled mipmaping directly. Therefore, a simple method is used to solve this problem: using the linear depth of pixels, We can manually calculate the number of mipmap layers to be sampled (avoid using automatic DDX computing, this causes the discontinuous values between plots), and then uses texpaid to obtain the value of the corresponding mipmap.

Then observe the GPA data.

At this time, several bottlenecks have been pushed to a reasonable consumption range. The rendering time is mostly concentrated on shadowmap generation, ZPass, and general pass. This is a reasonable distribution of rendering pipeline consumption.

However, you can also note that the two yellow blocks used by the icons have taken a considerable amount of time and have exceeded the consumption of ssao and shadowmask.

The two are the blocks that occupy the largest screen pixels in the terrain system. By analyzing their shader assembly, we found that the number of samples per pixel reached an astonishing 26! (Zpass 7 times, General pass 19 times)

 

3. Select tex2dlevels and tex2dgrad

However, we can see that the number of tex_ld and tex_ldl is not that large. By searching, we can find that the original tex2dlays function can explicitly specify the number of levels of dsls, which consumes more than tex2d, tex2dgrad must be large. Two sampling commands are displayed in the GPA.

Therefore, you can change the tex2dlayd method to tex2dgrad, and manually calculate the DDX to input interpolation instead of directly specifying the MIP layers. The number of pixel samples is directly halved.

4. Merge highlight textures into the alpha channel of diffuse

At the same time, it is found that the use of high-gloss textures in terrain textures is a little waste, simply make the high-gloss values into a monochrome and save the diffuse Alpha to cut down the consumption of the sampled high-gloss textures.

So far, the number of terrain block samples has been reduced from 26 to 9, and the rendering consumption of the terrain block has been directly reduced to 50%.

The third round introduces the hybrid rendering pipeline

After the first two rounds of optimization, the performance has basically been exhausted to the limit. If you do not enable sharpening, the P resolution can already reach FPS or above on gtx560. Next, we want to further optimize the image to ensure the image quality. There are only a few Breakthrough points:

  • 1. Reduce DP
  • 2. Reduce shader complexity

For the second point, to ensure that the rendering effect remains unchanged, this will be a long loop process of Optimization-testing.

For the first point, the current rendering pipeline is deferred lighting, which cryengine3 uses.

Advantage: with the advantage of delayed rendering decoupled illumination calculation, it can ensure the rich material effects of the main light source and get a rendering process with low bandwidth overhead.

Disadvantage: All opaque objects need to be rendered twice: zpass is used once, and the normal and linear depth are output. generalpass is used once, And the generated light data and the main light source are used for traditional material operations.

Crytek proposed the concept of hybird deferred shading in the gdc2013 speech, mixed the previously implemented delayed light with the traditional delayed rendering, and used the delayed rendering mode for common materials, only one rendering call is required. For complex materials, the rendering process of delayed lighting is the same as before.

Therefore, I decided to first introduce the traditional deferredshading implementation, first observe the efficiency, and create an architecture that allows for real-time flexible switching of rendering pipelines, and then consider further mixed rendering methods.

To introduce a delayed rendering pipeline that can be switched in hybrid mode, we need to transform the previous rendering process. The previous process is

Shadowmapgen-> zpass-> ssao-> deferred lighting-> shadowmask-> General pass-> Fog/HDR/DOF-> msaa-> output

For traditional delayed rendering, the general process is required, but different algorithm flows are required in the ZPass, deferred lighting, and general pass phases.

So I abstracted each stage into ibasepipe, wrapped the algorithm into the pipe derived class through the Policy mode, and then called each pipe for execution in the main rendering process, select the organization of the corresponding pipe implementation rendering algorithm.

Therefore, the rendering process changes

Pipe [shadowmapgen]-> pipe [zpass]-> pipe [ssao]-> pipe [shadowmask]-> pipe [deferred lighting]-> pipe [General pass]-> pipe [postprocess]-> output

At the same time, delayed rendering also requires a new ZPass and the shader of the final merged pass. At the same time, because the independent generalpass cannot be used, some special effects of the previous materials must be lost.

For G-BUFFER, the previous configuration was depth | r32f + normal + gloss | rgba8. The delayed rendering requires recording of the object's material information, so at least one layer of albeto color information needs to be added. Therefore, gbuffer expands an albeto DIF + spec | rgba8 MRT on the original basis. Outputs highlights of albeto color and monochrome.

Physical attributes, such as Fresnel, are not written at the moment. In the future, we will consider compressing normal data and opening up one more channel in normal to store data.

 

After the rendering process was changed to deferred shading, DP was halved, and the bandwidth pressure on the G-BUFFER increased by 50%. However, the overall efficiency is improved by about 5%.

Unfortunately, deferred shading requires more unified material attribute settings. Therefore, there are differences in the following table for delayed rendering of the previously set material properties for delayed lighting. Considering that the improvement is not obvious, the deferred lighing rendering pipeline is used by default.

 

 

 

Mission accomplished

After the task is completed, summarize the final efficiency.

Rendering Configuration:

1280x720 resolution, 35.9w triangle surface, 362 drawcall

All special effects

Rendering size of 0.75, ssao, shadowmask one-fold downsampling

Test results:
Test Platform Frame Rate Frame Time
Intel I5 2500 K & NVIDIA gtx560 241fps 4.14 Ms
Intel i7 3720qm & NVIDIA gt0000m 140fps 7.14 Ms
Intel I5 2500 K & Intel HD graphics 3000 30fps 33.33 Ms

 

 

 

Comparison of results:

The bottom two rendering results are the full rendering size and shadow, ssao full-size rendering results. It basically represents the rendering quality at the beginning of optimization.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.