Optimize Triangle Mesh Vertex

Source: Internet
Author: User

Optimize Triangle Mesh Vertex

For personal use only, do not reprint, do not use for any commercial purposes.

For the same triangle mesh, It is faster to use TriangleList or TriangleStrip for rendering. Where is the speed? Most people think that strip is faster, because strip has a better cache. Such an answer is actually totally wrong. For mesh arranged in the same way, strip only has the advantage of requiring fewer indexes and n + 2 indexes for n triangles, the list requires n * 3. Although a small amount of data means lower bandwidth usage, rendering is not affected in actual situations. No matter list or strip, the data volume of the index itself is not large (compared with vertex ), therefore, the minor advantages are hardly noticeable.

Will different rendering methods be different for the same mesh? If so, what is the difference? The answer is obviously yes, otherwise there will be no such article. What determines the rendering efficiency of a mesh is its index arrangement or triangle arrangement order. Excellent arrangement can significantly reduce vertices and pixel processing operations.

This is about the hardware architecture. In addition to the well-known video memory on the graphics card, there is also a very fast cache on the GPU chip, depending on the resource type, this cache is divided into texture cache, vertex cache, and so on. Here we focus on vertex caching. Is the GPU assembly line before the nvidia G80 architecture:

We can see that there are two vertex caches (vCache), one called pre-T & L cache and the other called Post-T & L Cache. Pre-cache stores the Vertex to be processed, and post cache stores the Vertex after transformation (that is, after Vertex shader. The existence of these two caches makes optimization possible. The two caches are both FIFO queues. Assume that you want to render the Triangle mesh shown in:

The vertex index is:

, 15 |, 16 |, 16 |, 17 ............ 15,16, 29 | 29,16, 30 ....................

The vcache status is as follows:

1, 2, 15 ---> 1, 2, 16 ---> 1, 2, 15, 16, 3 ----> 1, 2, 15, 16, 3, 17 ---> ............ ---> Xx, 15,16, 29 ----> xx, xx, 15,16, 29,30 ---> ................

The same vertex only appears once in the vcache. When the second triangle is rendered, vertex 15 and vertex 2 are already in the pre-cache, And the GPU does not need to look for them further, you only need to load the vertex 16 to the pre-cache to reduce the bandwidth requirement and the latency of vertex fetch. This is not the most important thing. The larger acceleration is that when the GPU is ready to process vertices 15 and 2, it finds that these two vertices already exist in the post-cache, therefore, it will completely skip vertex shader calculation and directly use the results in post-cache! By analyzing the closed mesh, we can find that a vertex is usually 5 ~ 6 triangles are shared (compared to vertex 16), which means that in the most ideal index mode, this vertex is processed only once, and of course the worst case is 6 times, there is a huge difference between the two. Due to the limitation of vcache capacity, we can see that when the second triangle is rendered, the buffer for the vertex 15 and 16 has been popped up, So GPU will perform repeated computing.

We already know the principle of vcache. How can we use it to calculate the optimal sequence of triangle indexes? The above rule triangle mesh is used as an example. The ideal index should be like this:

, 3 |, 6 |, 9 |, 12 |, 14 | + |, 15 |, 16 |, 16 | ...........................

For the convenience of discussion, I have written the index into two sections. If you are careful, you will find that the first segment is actually a series of degraded triangles, they will not generate any graphics, but just put all the vertices in the first row in the buffer. When the actual triangle is rendered, the vertices in each row are loaded into the vcache in sequence:

Vcache state after degenerate triangle;
Vcache state after first line of triangle:, 22, 23 .......... 29;
Vcache state after second line: 15,16, 17,18, 19,20, 21 .......... 29,30, 31,32 ........ 44

Obviously, this is the most ideal index order, and all vertices are completely reused. However, the actual situation is much more complicated. Obviously, it is assumed that the vcache can accommodate at least 28 vertices. In addition, this algorithm is heavily dependent on the vcache size. Finally, this algorithm is only effective for rule mesh (suitable for Terrain J ). For the first problem, you can use IDirect3DQuery9 to query the VCACHE to obtain the actual vcache size. However, the ati card never supports this query. For dx10, there was a similar query function in beta, but it was a pity that it was deleted in later versions. However, we can determine that for nvidia Geforce 4 ~ 7 Series video cards, vcache size at least 24, 8 series at least 32. For ati, except for dx10 cards, most of the vcache is only 14 :(. Obviously, when vcache is only 24, the index calculation of the above mesh becomes a little complicated. a feasible method is to divide the mesh into two subcolumns for rendering. For a complex model, before that, people have invented many excellent algorithms, such as the optimization emesh function in dx. Forsyth also found an algorithm independent of the vcache size. Finally, due to the flexibility of the triangle list index, it is easier to arrange the optimal triangle order.

The above only discusses one aspect of mesh optimization, called vertex-level optimization, and a pixel-level optimization that has been accompanied by early-z in recent years. However, due to the complexity of this algorithm, we will only introduce the basic principles. For the same mesh, assume that there are two faces a and B that face the same but block each other. If a Blocks B, if a is rendered first, then, the pixels produced by B can be removed by early-z to avoid useless pixel shader computation.

The role and significance of vertex optimization are obvious, and the performance can be significantly improved without modifying existing programs. So what should we do if we optimize it? Fortunately, we already have such an optimization tool. The OptimizeMesh and nvidia NvTriStrip tools mentioned above all achieve vertex-level optimization, while ati's Tootle tool not only implements vertex optimization, some more advanced algorithms are used for model occlusion optimization.

It should be noted that most of the above discussions are based on the dx9-level hardware architecture. For dx10-level graphics cards, the hardware itself has a better caching mechanism due to the advantages of the unified architecture, however, optimization of vertices is worthwhile during model preprocessing. Back to the question at the beginning, in fact, no matter whether it is list or strip, the index storage method itself will not bring a lot of difference, and the indexing organization method is the key, because the list is easier to optimize, therefore, for modern hardware, the optimized list is usually faster.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.