Translation: Clayman
Clayman_joe@yahoo.com.cn
For personal use only, do not use for any commercial purposes. For more information, see author ^_^.
Note: This article is the third chapter in GPU gems2. The level of this article is limited. For details, refer to the original article.
In interactive programs, one of the most important ways to enrich user experience is to present a world full of interesting objects. From countless grass, trees, to common sundries: all of these can improve the final effect of the screen, allowing users to maintain a "susion sion of disbelief )". Only when users believe in and are integrated into the world will they be emotional about the world-the holy grail of Game Development (Holy Grail ).
From the rendering point of view, to achieve this effect is nothing more than rendering a large number of small objects. In general, these objects are similar to each other, there are only small differences in color, position, and orientation. For example, the geometric shapes of all trees in the forest are similar, and there is a big difference in color and height. For users, a forest composed of trees with different shapes is authentic, so they can trust it and enrich their gaming experience.
However, rendering a large number of small objects composed of a small number of polygon using the current GPU and graphics library brings great performance loss. Graphics APIs such as direct3d and OpenGL are not designed to render objects with only a few polygon thousands of times per frame. This article will discuss how to use direct3d to render the same ry into a large number of unique entities (instances ). It is an example of using this technology in back & white 2:
3.1 why geometry instancing (why geometry instancing)
In direct3d, submitting triangle data to the GPU is a relatively slow operation. Wloka 2003 displays that direct3d is used on a 1 GHz CPU and can only render 10000 to 400000 batches (batches) per second ). For modern CPUs, it can be predicted that this value is between 30000 to 120000 batches per second (for 30 frame/sec systems, about 1000 to 4000 batches per frame ). This is too little! This means that if I want to render a forest and submit data of one tree in each batch, no matter how many polygon each tree contains, will not be able to render more than 4000 trees-because the CPU has no time to process other tasks. Of course we don't want to see this situation. In the application, we want to minimize the rendering state and texture changes. At the same time, we use a method call in direct3d to render the same triangle multiple times in the same batch. In this way, the CPU batch submission time can be reduced, and the CPU resources can be left to physical, AI, and other systems.
3.2 Definition (Definitions)
Let's first define a series of geometry instancing-related concepts.
3.2.1 ry Packet)
A geometry packet is a description of a packet of geometry to be instanced, a collection of vertices and indices. A geometric package can be described using Vertex information, including its location, texture coordinates, normal, tangent space (tangent space), skeleton information for skinning, and index information in the vertex stream. Such a description can be directly mapped to an efficient method for submitting ry.
The ry package is an abstract description of a ry in the model space, which can be independent from the current rendering environment.
The following is a possible description of the ry package. It not only contains the information of the ry, but also the information of the boundary sphere of the object:
Struct geometrypacker
{
Primitive mprimtype;
Void * mvertice;
Unsigned int mvertexstride;
Unsigned short * mindices;
Unsigned int mvertexcount;
Unsigned int mindexcount;
D3dxvector3 mspherecentre;
Float msphereradius;
}
3.2.2 instance attribute)
For each object, the typical attributes include the coordinate transformation matrix from the model to the world, the object color, and the skeleton provided by animation player for skin the geometric package.
Struct instanceattributes
{
D3dxmatrix mmodelmatrix;
D3dcolor minstancecolor;
Animationplayer * manimationplayer;
Unsigned int mlevels;
}
3.2.3 geometry instance)
Ry is a collection of geometric packages and specific attributes. He directly contacts a geometric package and an object attribute to be used for rendering, including the complete description of the object to be submitted to the GPU.
Struct geometryinstance
{
Geometrypacket * mgeometrypacket;
Instanceattributes minstanceattributes;
}
3.2.4 rendering and texture environment (render and texture context)
The rendering environment refers to the current GPU rendering status (such as alpha blending, testing States, and active render target ). The texture environment refers to the currently active texture. Classes are usually used to modularize rendering and texture states.
Class rendercontext
{
Public:
// Begin the render context and make its render state Active
Void begin (void );
// End the render context and restore previous render states if necessary
Void end (void );
PRIVATE:
// Any description of the current render state and pixel and vertex shaders.
// D3dx effect framework is special useful
Id3deffect * meffect;
// Application-specific render states
//....
};
Class texturecontext
{
Public:
// Set current textures to the appropriate texture stages
Void apply (void) const;
PRIVATE:
Texture mdiffusemap;
Texture mlightmap;
//........
}
3.2.5 ry batch)
The geometric batch is a collection of geometric entities and used to render the rendering state and texture environment of the set. To simplify the class design, it is usually mapped directly to a drawindexedprimitive () method call. The following is an abstract interface of the geometric batch class:
Class geometrybatch
{
Public:
// Remove all instances form the geometry batch
Virtual void clearinstances (void );
// Add an instance to the collection and return its ID. Return-1 if it can't accept more instance.
Virtual int addinstance (geometryinstance * instance );
// Commit all instances, to be called once before the render loop begins and after every change to the instances collection
Virtual unsigned int commit (void) = 0;
// Update the geometry batch, eventually prepare GPU-specific data ready to be submitted to the driver, fill vertex and
// Index buffers as necessary, to be called once per frame
Virtual void Update (void) = 0;
// Submit the batch to the driver, typically impemented eith a call to drawindexedprimitive
Virtual void render (void) const = 0;
PRIVATE:
Geometryinstancescollection minstances;
}
3.3 implementation)
The engine Renderer can only use geometry instancing through the abstract interface of geometrybatch, which can well hide the specific instancing implementation. At the same time, provides services to manage entities, update data, and render batches. In this way, the engine can concentrate on Sorting batches to minimize rendering and texture state changes. At the same time, geometrybatch implements the specific implementation and communicates with direct3d.
The pseudo code below implements a simple rendering loop:
// Update phase
Foreach geometrybatch in activebatcheslist
Geometrybatch. Update ();
// Render phase
Foreach renderjcontext
Begin
Rendercontext. beginrendering ();
Rendercontext. commitstates ();
Foreach texturecontext
Begin
Texturecontext. Apply ();
Foreach geometrybatch in the texture Context
Geometrybatch. Render ();
End
End
In order to update all batches at a time and perform multiple rendering, the update and rendering phases should be divided into two separate parts: This method is particularly useful in rendering shadow pasters or reflection and refraction of the water surface. Here we will discuss the implementation of four geometrybatch methods, and analyze the performance characteristics of various technologies by comparing memory usage and controllability.
Here is a rough summary:
* Static batching: the fastest way to execute instance geometry. Each entity moves to the world coordinates through a transformation, attaches the property value, and then submits it to the GPU. Static batches are simple, but they are also the least controllable.
* Dynamic batching: the slowest insance geometry method. In each frame, each transformed object attached with an attribute is transferred to the GPU as a stream. Dynamic batches can perfectly support skinning and are also the most controllable.
* Vertex constants instancing: A hybrid implementation method. The geometric information of each object is copied multiple times and copied to the GPU cache at one time. Through vertex constants, each frame resets the object attributes and uses a vertex shader to complete gemetry instancing.
* Batching with geometry instancing API. Using the geometry instancing API provided by DirectX 9, you can obtain full hardware support for the GeForce 6 series graphics card. This is an efficient and highly controllable gemetry instancing method. Different from other methods, it does not need to copy the geometric package to the direct3d vertex stream.
3.3.1 static batching)
For static batches, we want to copy all objects to a static vertex buffer after one transformation. The biggest advantage of static batches is high efficiency, and almost all GPUs on the market can support this feature.
To achieve static batches, first create a vertex buffer object (including index buffering, of course) that fills the changed body ). Make sure that this buffer is large enough to store all the entities we want to process. Because we only fill the buffer once and do not modify it any more, we can use the d3dusage_writeonly flag in direct3d to prompt the driver to put the buffer in the fastest available video memory:
Hresult res;
Res = lpdevice-> createvertexbuffer (max_static_buffer_size, d3dusage_write, 0, d3dpool_managed, & mstaticvertexstream, 0 );
Engine_assert (succeeded (RES ));
You can use the d3dpool_managed or d3dpool_default flag to create a buffer based on the application type or engine memory management mode.
Next, implement the Commit () method. It will fill the geometric data that needs to be rendered through coordinate transformation into the top point and index buffer. The pseudo-code implementation of the commit method is as follows:
Foreach geometryinstance in instances
Begin
Transform geometry in mgeometrypack to World Space with instance mmodelmatrix
Apply other instnce attributes (like instace color)
Copy transformed geometry to the vertex buffer
Copy indices (with the right offset) to the index Buffer
Advance current pointer to the vertex buffer
Advance currect pointer to the index Buffer
End
Now, we only need to use the drawindexedprimitive () method to submit the prepared data. The implementation of the update () method and the render () method is very simple and is not discussed here.
A static batch is the fastest way to render a large number of objects. It can contain different types of geometric packages in a batch, but there are also some serious restrictions:
* Large memory usage (large memory footprint): memory usage may increase depending on the size of the geometric package and the number of objects to be rendered. For large scenarios, we should reserve the space required by the ry. Falling back to AGP memory is possible, avoid it as much as possible.
* No support for different level of detal is not supported. Because all objects are copied to the vertex buffer at one time during data submission, therefore, it is difficult to select an effective level of detail for each environment. At the same time, the budget for the number of polygon is incorrect. We can use a semi-static method to solve this problem. We put all the levels of the pads of a specific object in the vertex buffer, and select different index values for each frame, to select the correct object level. However, this will make the implementation seem clumsy, in violation of the original purpose of using this method: simple and efficient.
* No support for skinning
* No direct support for moving instances: due to efficiency, entity movement should be implemented using vertex shader logic and dynamic batches. The final solution is vertex constants instancing.
The next method removes these restrictions in exchange for controllability at the expense of rendering speed.
3.3.2 dynamic batch (dynamicbatching)
Dynamic batches overcome the limitations of static batches at the cost of reducing rendering efficiency. The maximum advantage of dynamic batches is the same as that of static batches. It can also be used on GPUs that do not support advanced programming pipelines.
First, use the d3dusage_dynamic and d3dpool_default labels to create a vertex buffer (also including the corresponding index buffer ). These flags will ensure that the buffer is at the most easily positioned in the memory to meet our dynamic update requirements.
Hresult res;
Res = lpdevice-> createvertexbuffer (max_dynamic_buffer_size, d3dusage_dynamic | d3dusage_writeonly, 0, d3dpool_default, & mdynamicvertexstream, 0)
It is important to select the correct max_dynamic_buffer_size value. There are two policies to select this value:
* Select a token that can accommodate all possible entities in each frame.
* Select a value that is large enough to accommodate a certain amount of entities.
The first method ensures the independence of the update and rendering batches to a certain extent. Updating batches means streaming of all data in the dynamic buffer, while rendering only submits geometric data using the drawindexedprimitive () method. This method will occupy a large amount of graphics memory (Display memory or AGP memory), and in the worst case, this method will become unreliable, because we cannot guarantee that the buffer is large enough throughout the life of the application.
The second strategy requires that the data stream and rendering of the ry be staggered: when the dynamic buffer is filled, the ry is submitted for rendering and the data in the buffer is discarded, prepare to fill in more entities that will be converted into data streams. To optimize the performance, it is very important to use the correct mark. In other words, when each batch of entities starts, the d3dlock_discard mark is used to lock the Dynamic Buffer. In addition, use the d3dlock_writeonly flag for each new entity to be converted into data streams. The disadvantage of this method is that every time a batch needs to be rendered, the buffer needs to be re-locked to convert the geometric information into data streams, such as shadow.
You should select different methods based on the application type and specific requirements. Here, for simple and clear reasons, we chose the first method, but added a little complexity: the dynamic batch naturally supports skinning, And we implemented it by the way.
The update method is similar to the Commit () method discussed earlier in 3.3.1, but it needs to be executed at each frame. Here is the implementation of pseudo code;
Foreach geometryinstance in instances
Begin
Transform geometry in mgeometrypacket to World Space with instance mmodelmatrix
If instance nedds skinning, request a set of bones from manimationplayer and skin Geometry
Apply other instance attributes (like instance color)
Copy transformd geometry to the vertex buffer
Copy indices (with the right offset) to the index Buffer
Advance current pointer to the vertex buffer
Advance current pointer to the index Buffer
End
In this case, the render () method simply calls the drawindexedprimitive () method.
This tutorial is copyrighted by me and is for personal use only. Do not repost it. It is not used for any commercial purposes. For commercial applications, contact me.
Due to my limited level, errors are inevitable. If you are not clear about it, please refer to the original document. You are also welcome to talk with me a lot.
Some of the images are from the Internet, and they are the same as the illustrations in the original book.
Thank you very much for recreating the flowchart in the document.
Translation: Clayman
Blog: http://blog.csdn.net/soilwork
Clayman_joe@yahoo.com.cn
3.3.3 vertex constants instancing
In the vertex constants instancing method, we use vertex constants to store object attributes. In terms of rendering performance, the batch of vertex constants is very fast and supports moving object positions, but these features are at the expense of controllability.
The following are the main restrictions of this method:
* According to common sense values, the number of objects in each batch is limited. Generally, for a method call, the number of objects in a batch cannot exceed 50 to 100. However, this is enough to reduce the load of the CPU to call the drawing function.
* Skinning is not supported. All vertex constants are used to store object attributes.
* Hardware supporting vertex shaders
First, you need to prepare a static vertex buffer (also including the index buffer) to store multiple copies of the same geometric package. Each copy is saved in the coordinate space of the model, and corresponds to an entity in the batch.
You must update the original vertex format to add an integer index value for each vertex. For each object, this value is a constant that indicates the entity of a specific geometric package. This is somewhat like palette skinning. Each vertex contains an index that points to one or more of its bones.
The updated vertex format is as follows:
Stuct instancevertex
{
D3dvector3 mposition;
// Other properties ......
Word minstanceindex [4]; // direct3d requires short4
};
After all the object data is added to the geometric batch, The COMMIT () method prepares the vertex buffer according to the correct design.
Next, load attributes for each object to be rendered. We assume that the attributes only include the model matrix describing the object position and orientation, and the object color.
For GPUs supporting the directx9 series, up to 256 vertex constants can be used: 200 of them are used to save object attributes. In our example, each entity requires four constant storage model matrices and one constant storage color. In this way, each entity requires five constants, so each batch can contain up to 40 entities.
The following is the update () method. The actual entity is processed in vertex shader.
D3dvector4 instancesdata [max_number_of_constants];
Unsigned int COUNT = 0;
For (unsigned int I = 0; I <getinstancescount (); ++ I)
{
// Write model matrix
Instancesdata [count ++] = * (d3dxvector4 *) & minstances [I]. mmodematrix. M11;
Instancesdata [count ++] = * (d3dxvector4 *) & minstances [I]. mmodelmatrix. m21;
Instancesdata [count ++] = * (d3dxvector4 *) & minstances [I]. mmodelmatrix. m31;
Instancesdata [count ++] = * (d3dxvector4 *) & minstances [I]. mmodelmatrix. M41;
// Write instance color
Instacedata [count ++] = convercolortovec4 (minstances [I]. mcolor );
}
Lpdevice-> setvertexconstants (instances_data_first_constant, instancesdata, count );
Below is the vertex shader:
// Vertex input Declaration
Struct vsinput
{
Float4 postion: positon;
Float3 normal: normal;
// Other vertex data
Int4 instance_index: blendindices;
};
Vsoutput vertexconstantsinstancingvs (in vsinput input)
{
// Get the instance index; the index is premultiplied by 5 to take account of the number of constants used by each instance
Int instanceindex = (INT [4]) (input. instance_index) [0];
// Access each row of the Instance model matrix
Float4 m0 = instancedata [instanceindex + 0];
Float4 M1 = instancedata [instanceindex + 1];
Float 4 m2 = instancedata [instanceindex + 2];
Float 4 m3 = instancedata [instanceindex + 3];
// Construct the model matrix
Float4x4 modelmatrix = {M0, M1, M2, M3}
// Get the instance color
Float instancecolor = instancedata [instanceindex + 4];
// Transform input position and normal to world space with the instance model matrix
Float4 worldpostion = MUL (input. Position, modelmatrix );
Float3 worldnormal = MUL (input. Normal, modelmatrix;
// Output posion, normal and color
Output. Position = MUL (worldpostion, viewprojectionmatrix );
Output. Normal = MUL (worldpostion, viewprojectionmatrix );
Output. Color = instancecolor;
// Output other vertex data
}
The render () method sets the observation and projection matrices, and calls the drawindexedprimitive () method to submit all objects.
In actual code, the rotating part of the model space can be stored as a quaternion, which saves two constants and increases the maximum number of entities to about 70. Then, re-construct the matrix in vertex shader. Of course, this also increases the encoding complexity and execution time.
3.3.4 batching with the geometry instancing API
The last method introduced is the batch of geometric entity APIs introduced in directx9 that can be fully implemented by the geforce 6 series GPU hardware. With more hardware supporting geometric entity APIs, this technology will become more interesting. It only needs to occupy a very small amount of memory and does not require too much CPU interference. Its only drawback is that it can only process entities from the same geometric package.
Directx9 provides the following functions to access the geometric entity API:
Hresult setstreamsourcefreq (uint streamnumber, uint frequencyparameter );
Streamnumber is the index of the target data stream. frequencyparameter indicates the number of objects contained in each vertex.
We first create two fast Vertex buffers: a static buffer to store a single geometric package that will be materialized multiple times; a dynamic buffer to store Entity Data. Shows two data streams:
Commit () must ensure that all ry uses the same ry package and copy the ry information to the static easing [Source: gameres.com.
Update () simply copies all object attributes to the dynamic buffer. Although similar to the update () method in a dynamic batch, It minimizes CPU interference and the graphic bus (AGP or PCI-E) bandwidth. In addition, we can allocate a large enough vertex buffer to meet the needs of all object attributes without worrying about memory consumption, because each object attribute only occupies a small part of the memory consumption of the entire geometric package.
The render () method uses stream frequency to set two streams, and then calls the drawindexedprimitive () method to render all objects in the same batch. The Code is as follows:
Unsigned int instancescount = getinstancescount ();
// Set U stream source frequency for the first stream to render instancescount instances
// D3dstreamsource_indexeddata tell direct3d we'll use indexed geometry for instancing
Lpdevice-> setstreamsourcefreq (0, d3dstreamsource_indexeddata | instancescount );
// Set up first stream source with the vertex buffer containing geometry for the geometry Packet
Lpdevice-> setstreamsource (0, mgeometryinstancingvb [0], 0, mgeometrypacketdeck );
// Set up stream source frequency for the second stream; each set of instance attributes describes one instance to be rendered
Lpdevice-> setstreamsoucefreq (1, d3dstreamsource_indexeddata | 1 );
// Set up second stream source with the vertex buffer containing all instances 'bubutes
Pd3ddevice-> setstreamsource (1, mgeometryinstancingvb [0], 0, minstancesdatavertexdecl );
GPUs package vertices from the first stream to the second stream through virtual replication (virtually duplicating. The vertex shader input parameters include the vertex position in the model space and the entity attributes used to transform the model matrix to the world space. The Code is as follows:
// Vertex input Declaration
Struct vsinput
{
// Stream 0
Float4 position: position;
Float3 normal: normal;
// Stream 1
Float4 model_matrix0: texcoord0;
Float4 model_matrix1: texcoord1;
Float4 model_matrix2: texcoord2;
Float4 model_matrix3: texcoord3;
Float4 instance_color: d3dcolor;
};
Vsoutput geometryinstancingvs (in vsinput input)
{
// Construct the model matrix
Float4x4 modelmatrix =
{
Input. model_matrix0,
Input. model_matrix1,
Input. model_matrix2,
Input. model_matrix3,
}
// Transform inut position and normal to world space with the instance model matrix
Float4 worldposition = MUL (input. Position, modelmatrix );
Float3 worldnormal = MUL (input. Normal, modelmatrix );
// Output positon, normal, and color
Output. positon = MUL (worldpostion, viewprojectionmatrix );
Output. Normal = MUL (worldnormal, viewprojectionmatrix );
Output. Color = int. instance_color;
// Output other vertex data .....
}
Since the CPU load and memory usage are minimized, this technology can efficiently render a large number of copies of the same ry, and thus is an ideal solution in the game. Of course, its disadvantage is that it requires support of hardware functions, and it cannot easily implement skinning.
To implement skinning, you can save all the skeleton information of all objects as a texture, and then select the correct skeleton for the corresponding object. This requires the vertex texture access function in shader model3.0. If this technology is used, the performance consumption caused by accessing the vertex texture is uncertain and should be tested.
3.4 Conclusion
This article describes the concept of ry and describes four different technologies to efficiently render the same ry multiple times. Each technology has its own advantages and disadvantages. There is no single solution to the problems that may occur in the game scenario. Select the appropriate method based on the application type and the rendered object type.
The following are the recommended methods in some scenarios:
* For indoor scenarios that contain a large number of static entities in the same ry, static batches are the best choice because they are rarely moved.
* For outdoor scenarios that contain a large number of animated entities, such as real-time strategic games with hundreds of fighters, dynamic batches may be the best choice.
* Outdoor scenarios that contain a large number of vegetables and trees usually need to modify their attributes (for example, to achieve the effect of moving with the wind), as well as particle systems, the geometric batch API may be the best choice.
Generally, the same application uses more than two methods. In this case, an abstract geometric batch interface is used to hide the specific implementation, making it easier for the engine to be modularized and managed. In this way, the implementation of ry materialized can also be much reduced for the entire program.
(In the figure, static buildings use static batches, while the tree uses geometric entity APIs)
The complete PDF document is attached. For the complete demo, you can refer to the example instancing In the nvidia sdk or download it directly here. You can also refer to the example instancing in DirectX SDK.