Judging from the code now Ogre2.1, in summary, the update contains a plethora of design patterns, SOA data structures (for Simd,dod), new threading patterns, new rendering processes and scene updates, new material management systems, new model formats, new synthesizer scenarios, and updates that are all about, It can be said that ogre2.x and ogre1.x are not the same engine, whether it is efficiency, or from the use of new ideas to render.
The main reference to two major documents, one is OGRE.2.0.PROPOSAL.SLIDES.ODP, now OGRE one of the defenders dark_sylinc than the other engine and related tests written Ogre2.0 to modify the direction, one is OGRE 2.0 Porting Manual Draft.odt, transplant manual, in simple terms, Ogr2.0 specific modifications to the location and description. Very valuable two documents, you can say, this is the essence of the new ogre changes, we from these two documents, we can learn how to target the C + + game engine level of inclusion efficiency , usability refactoring. This is a fortunate learning experience.
Download the latest version from the Https://bitbucket.org/sinbad/ogre, inside the doc folder, there are many documents, I organized the next, each part contains the reason for the change, change the location, the relevant code, because it is all English documents, so if you understand there are errors, you are welcome to point out , do not want to mislead everyone. This article is only for the new model new features, that is, add the V1 namespace (ogre1.x function, there are corresponding ogre2.x version), this article will not specifically explain.
Problems and suggestions in ogre1.x
Cache End Hit
Take a look at the author's slides, haha, the image is particularly vivid.
This is a function that determines whether the model is visible in the current camera and, if visible, joins the render channel. But you go to see the Ogre2.1 code now, this method does not change, not because it is not changed, because the rendering process has changed, this function is movableobject: The cullfrustum contains the algorithm that determines the intersection of the camera and the model AABB.
This function is calculated for each model for each frame, and if the cache misses, the loss is a bit large.
This is also commonly used to get the model of the world coordinate position, but also for each of the models to calculate, if the cache miss, ditto.
So how to improve, like above, you do not have those judgments, or more calculations, or the results are not correct, the author gives the answer is to improve the rendering process, reduce the appearance of judgment conditions. I'll tell you later.
Inefficient scene traversal and operation.
You can see that the scene repeats every update, check for updates, and then update. A lot of unnecessary variables and whether to update the status of the tracking, as well as too many judgments, respectively, resulting in cache misses cached unfriendly. (Am I going to, if judged to be so destructive?) or just engine-level code that makes such an impact, and in the subsequent rendering process, many if are removed.
It then points out that the ogre rendering process, where the scenemanager::_renderscene () calls too many times, such as shadow map once, Render_scene once in the synthesizer, and then they have not reused the rejected data, Each time the renderscene, it was removed once. In particular, multiple calls to Render_scene in the synthesizer, each time the model in the render queue is all checked out, this is an invalid operation.
Synthesize these two points, the rendering team must change greatly, the following is the author synthesizes other commercial rendering engine, gives the new implementation suggestion in Ogre2.0, according to now Ogre2.1 I see the code, already realized the following diagram function.
This figure will be followed by a simple thread-related section, this is the ogre2.x rendering process, from which we can see that the new synthesizer is part of the ogre Core, is not an optional component, and of course the new synthesizer has a considerable amount of update, more powerful, and better use. The more detailed part, The following will be devoted to an introduction to ogre2.x new rendering process and synthesizer.
Simd,dod,soa
When you look at the following, let's introduce what is a DoD based design. DOD (data-oriented design) and our common Ood (object-oriented design) for object-oriented OOP
The dispute between DoD and Ood: Data oriented design vs. Object oriented design
data-oriented Design data-oriented Design Two What is DoD, why use DoD, and under what circumstances use DoD
Data based design (data-oriented) This is the CSDN translation for the first article
Interested in reading carefully, here is a summary of DoD relative to OOP advantages. Compact and efficient parallelization, caching-friendly.
First look at the following http://stackoverflow.com/questions/12141626/data-oriented-design-in-oop a problem, the second code is as follows:
DOD Void updateaims (Float* aimdir, const aimingdata* aim, vec3 target, uint count) { for (uint i = 0; i < count; i++) {
AIMDIR[I]&NBSP;=&NBSP;DOT3 (Aim->positions[i], target) * aim->mod[i]; &NBSP;&NBSP;&NBSP;&NBSP;&NBSP}}//oop Class bot { vec3 position;
float mod;
float aimDir; void updateaim (vec3 target) {
&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AIMDIR&NBSP;=&NBSP;DOT3 (Position, target) * mod;
} }; void updatebots (bots* pbots, uint count, vec3 target) { for (uint i = 0; i < count; i++)
pbots[i]->updateaim (target);
}; Dod vs oop
One of the following explains why the first code is efficient, in the second code, each time you get a domain, waste more bandwidth, and update unwanted data into the cache, the cache is Miss High. The first one takes a float block at a time to improve the efficient use of the cache.
Like the most common operation in the game, get the MVP matrix for each model, and OOP tells us that we need to get the model first. The model also contains many other content, but is useless, consuming cache space, cache hits are low. And DoD is to put all the fields together, take all the positions, See SOA below.
Here are two more concepts, one for SOA (Structure of Arrays, not for your Baidu search), one for AOS (Arrays of Structure), and, for the moment, SOA is a commonly used data organization method in DoD, Corresponds to the AOS organization method commonly used in OOP.
Simply put, the way SOA is organized is to keep each field of a set of elements in a row, as I rewrite the code for ogre2.x.
Struct vector3 { float x = 0; float y
= 0;
float z = 0; vector3 () { } Vector3 (FLOAT&NBSP;NX,&NBSP;FLOAT&NBSP;NY,&NBSP;FLOAT&NBSP;NZ) {
x = nx;
y = ny;
z = nz; &NBSP;&NBSP;&NBSP;&NBSP} float& operator [] (const size_t i) { assert (i >= 0
&NBSP;&&&NBSP;I&NBSP;<&NBSP;3);
return * (&x + i);
}; struct quaternion { float x;
float y;
float z;
float w;
}; ood Object-oriented design Struct transform { Vector3 pos;
vector3 scale;
Quaternion orient; void move (vector3 move) { for (int i = 0; i < 3; i++) {
pos[i] += move[i];
} }; SIMD Struct arrayvector3 { float x[4]; float y[4]
;
float z[4]; arrayvector3 () { memset (x, 0, sizeof (float) * 4);
memset (y, 0, sizeof (float) * 4);
memset (z, 0, sizeof (float) * 4); &NBSP;&NBSP;&NBSP;&NBSP} vector3 getindex (Int index) { assert (Index >= 0 && index
&NBSP;<&NBSP;4);
return vector3 (X[index], y[index], z[index]); &NBSP;&NBSP;&NBSP;&NBSP} void setindex (int index, float fx, FLOAT&NBSP;FY,&NBSP;FLOAT&NBSP;FZ) {
ASSERT (INDEX&NBSP;>=&NBSP;0&NBSP;&&&NBSP;INDEX&NBSP;<&NBSP;4);
x[index] = fx; y[index] = fy;
z[index] = fz;
};
struct arrayquaternion { float x[4]; float y[4];
float z[4];
float w[4];
}; SoA (structure of arrays) Struct arraytransformsoa { ArrayVector3*
Pos
ArrayVector3* scale;
ArrayQuaternion* orient;
int mIndex = 0; arraytransformsoa () {
pos = new arrayvector3 ();
scale = new arrayvector3 ();
orient = new arrayquaternion (); } &nbSp; ~arraytransformsoa () { delete
pos;
delete scale;
delete orient;
&NBSP;&NBSP;&NBSP;&NBSP} void move (Vector3 move) { //xxxxyyyyzzzz
Float *soa = reinterpret_cast<float*> (POS); for (int i = 0; i < 3; i++) { for (int j = 0; j < 4; j++) { soa[i * 4 + j] += move[i]; } &NBSP;&NBSP} } void setpos (Float x, float y , float z) { pos->setindex (
MINDEX,&NBSP;X,&NBSP;Y,&NBSP;Z); &NBSP;&NBSP;&NBSP;&NBSP} vector3 getpos () {
return pos->getindex (Mindex);
}; Void soavaos () { //aos (arrays of structure)
transform aosarray[4];
ArrayTransformSoA soaArray;
vector3 moveadd (4.0f, 2.0f, 1.0f); for (int i = 0; i < 4; i++) { aosarray[i].move
(Moveadd);
&NBSP;&NBSP;&NBSP;&NBSP} soaarray.move (Moveadd); for (int i = 0; i < 4; i++) { cout << aosArray[i].pos.x
<< endl;
soaArray.mIndex = i; cout << soaarray.getpos () .x <<
endl;
&NBSP;&NBSP;&NBSP;&NBSP} cout << "" << endl; } SIMD
The following ArrayVector3 and Arraytransformsoa are the SOA-related organization and operations, where the notes above are SIMD, because the organization is really used for SIMD, which is not a lot to say, if you have the opportunity to specifically study this later to elaborate , here, we just need to know that SSM2 can handle 128 bits of data each, under 32 bits, each processing 4 float data, such as the Move method in the Arraytransformsoa above, the second loop is an instruction for using the SSM2 instruction, In simple terms, the speed is 4 times times higher, which can be said to be one of the simplest secure parallel processing, don't you set the thread, care about the synchronization of what. More specifically, in the game can handle four vertices to operate, such as moving, scaling, and matrix operations (of course, these four vertices also have limitations, not all put together, please see the back) . This is also cache-friendly, as shown in the above illustration.
It shows a common SOA architecture approach, and you can see that for an object, his storage is no longer contiguous, where the position y and X are related to the 4*sizeof (float) distance, and they should be adjacent to the object-oriented structure. But for SOA, Each field in the object list is contiguous, as shown in the figure above, and it should be xxxxyyyyzzzz this memory layout, not xyzxyzxyzxyz. As I said before, DoD also uses SOA architecture, so this is not the DoD core design in Ogre, Because these four groups of only specifically for SIMD SOA architecture, the real DoD core should be ogre in the Arraymemorymanager class, directly drag out this method may not see, the following is my ogre2.x Arraymemorymanager rewrite, only keep the core to help you understand.
dod data-oriented design Class arraytransformmanager {private: enum elementtype { pos, scale, orient,
elementcount };
int elements[ElementCount];
vector<char*> memoryPool;
int totalSize = 0;
int maxMemory = 32;
//Current int nextSlot = 0; Public: arraytransformmanager () {
elements[pos] = 3 * sizeof (float);
totalSize += elements[Pos]; &nbsP; elements[scale] = 3 * sizeof (float);
totalSize += elements[Scale];
elements[orient] = 4 * sizeof (float);
totalSize += elements[Orient];
memorypool.resize (Elementcount); &NBSP;&NBSP;&NBSP;&NBSP} void initialize () { for (int i = 0; i < elementcount; i++) {
int byteCount = elements[0] * maxMemory; memoryPool[i] = new
Char[bytecount]; memset (Memorypool[i], 0, bytecount); } void Createarraytransform (arraytransformsoa &outtransform) {
int current = nextSlot++; //current = 0,nextslotidx = 0,nextslotbase = 0 //current = 3,nextslotidx = 3,nextslotbase = 0 //current = 4,nextSlotIdx = 0,nextSlotBase = 4 //current = 5,nextslotidx = 1, Nextslotbase = 4 &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp; //current = 7,nextslotidx = 3,nextslotbase = 4 //current = 8,nextslotidx = 0,nextslotbase = 8 int
nextslotidx = current % 4; int nextSlotBase = current -
Nextslotidx;
outTransform.mIndex = nextSlotIdx; outtransform.pos = reinterpret_cast<arrayvector3* > ( memoryPool[Pos] +
Nextslotbase*elements[pos]); outTransform.scale = reinterpret_cast< Arrayvector3*> ( memorypool[scale] + nextslotbase*elements[scale]); outTransform.orient = reinterpret_cast< Arrayquaternion*> ( memorypool[orient
] + nextslotbase*elements[orient]); outtransform.setpos (nextslotidx, nextslotidx,
NEXTSLOTIDX);
}; Void testdod () { ArrayTransformManager transformDOD;
Transformdod.initialize ();
ArrayTransformSoA transform0;
transformdod.createarraytransform (TRANSFORM0);
ArrayTransformSoA transform1;
transformdod.createarraytransform (TRANSFORM1);
ArrayTransformSoA transform2; transformdod.createarraytransform (transform2);
ArrayTransformSoA transform3;
transformdod.createarraytransform (TRANSFORM3);
ArrayTransformSoA transform4;
transformdod.createarraytransform (TRANSFORM4);
cout << transform0.getpos () .x << endl;
cout << transform1.getpos () .x << endl;
cout << transform2.getpos () .x << endl;
cout << transform3.getpos () .x << endl;
cout << transform4.getpos () .x << endl; } arraytransformmanager//corresponds to the Arraymemorymanager in ogre
This is combined with the Smid DoD design, do not look at the Smid part, see the initialization section, MaxMemory represents the maximum number of transform, and elements said transform corresponds to the number of each field, MemoryPool represents each field ( contiguous) How many fields together, in which the call Createarraytransform generates a ARRAYTRANSFORMSOA data, every four consecutive arraytransformsoa such as Pos,scale, such as the image above , the difference is the corresponding Mindex, which indicates that the index is currently in the SOA. In contrast to the above ArrayVector3, the organization data should be the same, the difference is a bit and emit maxmemory data, And ArrayVector3 a four Vector3 data. In general, this is the SOA data structure, used for SIMD and DoD respectively.
Slide document 1th and 2nd mainly contains two points, first, the rendering process (after the text in detail), the second is Smid,dod and other basic data format and operation of the changes. The back of the vertex format and shader for the moment, we are interested to see, the corresponding code has not been found, I do not know whether has been completed.
Finally, it is also pointed out that the design pattern in Ogre is overused, indicating that the ood is too wasteful, and using macros to control some virtual_l0, VIRTUAL_L1, virtual_l2 to control the polymorphism level, default is not enabled virtual functions, such as Scenenode:: GetPosition (), scenenode::setposition default can not be overloaded, if you define the Scenenode subclass, and overload the above function, you need to set the polymorphic level, and compile yourself. Well, Ogre1.x's most famous design patterns are also removed with the change in the rendering process. I believe you can see the relevant updates, will say that this change is too big.
"We don ' t care really how long it takes" ogre take advantage of Ood, at the customer level.
ogre2.x Transplant Manual:
Models, scenarios, and nodes:
After 1.ogre2.0, ogre many objects to remove the name, instead of IDOBEJCT, the more meaning is that many of the original aggregation relationship is used in the map to indicate that the name when key, now idobejct, and for automatic generation, so the correlation ID mean little, Correlation aggregation is a lightweight vector-like class Fastarray written by the Ogre developers themselves.
The Sorted Vector pattern part I
The Sorted Vector pattern part II
Because our scene, more such as update all the position of the model, AABB and so on, can quickly iterate is our biggest requirement, and vector is more space-saving, memory block continuous (AOS,SMID,DOD). Instead, the map's advantages such as Quick lookup, Random deletions and additions are not commonly used. So many of the maps in ogre1.x are unwise to use.
Among them, the ogre internal more aggregation relations use is Fastarray,fastarray is for the std::vector lightweight implementation, has eliminated the massive boundary check and the iterator verification. Follow most of the std::vector functions, such as Std::for_ Each works well and is not the same as the standard, such as Fastarray myarray (5), which does not automatically initialize 5 of the data is 0. Annotations are not recommended by default because, as we said earlier, unlike standard std::vector, this kind of main efficiency, Eliminate all boundary checks and related validation unless we know how we should use them.
2. How to view data such as movableobject and node.
Carefully read the previous SOA part, this part does not need to elaborate, node location information with transform save, movableobject information with Objectdata Save, the corresponding information is four a group for SIMD instruction acceleration, So a node corresponds to the transform actually has 4 transform information, these 4 transform position in the memory data below xxxxyyyyzzzz, according to transform Mindex (0,4) to find the corresponding data.
Avoid streaming SIMD, and a large number of non-null judgments, if there are only three node in transform, the last node is not set to NULL, and the virtual pointer is replaced, which is a commonly used method of DoD.
3. In ogre1.x, Movableobject is in the scene only after attaching to the Scenenode, and ogre2.x does not have the concept, node is the movableobject used to manipulate and save related location information. In fact, It should be that the changes to the rendering process change, the original rendering process, through the node level to find all the moveableobject is within the scope of the viewport, and now generate a movableobject after the corresponding Objectmemorymanager ( In the same way, assigning an SOA structure is to keep the pointer, and when the scene is removed, according to Objectmemorymanager, the movableobject is always in the scene. But node retains position information, no position signal, and cannot be rendered in the scene.
After the attaching/detaching operation, the corresponding setvisible,attaching is automatically invoked, automatically set visible to true, or it can be set to Flase,detaching, and automatically set visible to false. If you manually set true, an error occurs.
If you attaching a scenenode, you attaching another scenenode, you need to detaching first. Otherwise the assertion is wrong.
4. All movableobject need scenenode, including lighting and camera, all need to attach to the Scenenode, very simple, originally such as light and camera and general movableobject some differences, one is not to render their , the second is to have their own location information, but now Scenenode is not used to render the channel, just save location information, natural and general movableobject like using Scenenode to save location information, Both the lighting and the camera must be attached to the Scenenode for location information.
5. Change node's local coordinate position, and can't get the corresponding global position immediately. Unlike the ogre1.x version, SetPosition sets a flag to indicate that the parent node is to be updated, and when the getderivedposition is invoked, Check to flag to update the parent node. This is an unfriendly cache design. In the ogre2.x, remove some of the above flag, update does not update the parent node, all nodes are updated in each frame of the updatealltransforms, It means that if you setposition, you need the current frame to run (call updatealltransforms) before you get the correct value of getderivedposition. Of course, if you must now, you can use Getderivedpositionupdated, of course this is usual, if possible, please update your design. At the same time, the getderivedposition in the original ogre1.x was divided into two methods in ogre2.x, and the If judgment was removed.
6.Node and Movableobject differentiate into dynamic and static, Static is to tell Ogre, I will not every frame to update the position. So ogre can do optimization one is to save the CPU, without each frame to update the corresponding position and the corresponding AABB, two is to tell the GPU, Some models can combine batch rendering. The dynamic model is only attached to the dynamic node. Static models can only be attached to static models. Dynamic nodes can contain static child nodes, while static nodes cannot contain dynamic child nodes (except the root node), for the simple reason that static nodes are infrequently updated, you put a dynamic child node in me, is let me update or not update.
7. Before Ogre2.0, we knew that the final rendering was only renderable and pass, where Movableobject all the present renderable into renderqueue when the scene visibility check model Users can also add renderable to the Renderqueue without Movableobject, and after Ogre2.1, renderable must be joined movableobject with the corresponding renderqueue, Because of the rendering model in the new rendering system, the LOD level required in rendering, skeleton animation and MVP matrix are all stored directly in Movableobject. Just like Ogre2.1, the MVP matrix is stored directly in the corresponding Movableobject, is no longer acquired through the renderable getworldtransforms. Please check the relevant queuedrenderable reference information in detail.
In addition to remove the original Ogre2.0 in the scene visibility check to get whether to receive the shadow used by the visitor mode, so that the modified person, this model cost too much, is not worth it. Of course Ogre2.0 after the shadow related all change, only support shadow map, and Shadow Map many variants of technology. Stencil Shadow Remove support, you can see that movableobject is no longer a subclass of Shadowcaster, this class wrapper template shadow related.
Another is the introduction of new models and Vao, the original VBO of the class also changed to the corresponding Vaomanager. The detailed analysis is given.
Simd,dod,thread:
SIMD and DoD Design before there is said, here, simply say a few classes. The files are under Ogremain/math/array/sse2.
The following code, as I said earlier, is extracted from ogre2.x, and is mainly based on the ogre of SIMD and DoD designs in the related context to help understand.
SIMD Struct arrayvector3 { float x[4]; float y[4]
;
float z[4]; arrayvector3 () {
memset (x, 0, sizeof (float) * 4);
memset (y, 0, sizeof (float) * 4);
memset (z, 0, sizeof (float) * 4); &NBSP;&NBSP;&NBSP;&NBSP} vector3 getindex (Int index) { assert (Index >= 0 && index
&NBSP;<&NBSP;4);
return vector3 (X[index], y[index], z[index]); &NBSP;&NBSP;&NBSP;&NBSP} void setindex (int index, float fx,
FLOAT&NBSP;FY,&NBSP;FLOAT&NBSP;FZ) { assert (index >= 0
&&&NBSP;INDEX&NBSP;<&NBSP;4);
x[index] = fx;
y[index] = fy;
z[index] = fz;
};
struct arrayquaternion { float x[4]; float y[4];
float z[4];
float w[4];
}; SoA (structure of arrays) Struct arraytransformsoa { ArrayVector3*
Pos
ArrayVector3* scale;
ArrayQuaternion* orient;
int mIndex = 0; arraytransformsoa () { pos = new arrayvector3 ();
scale = new arrayvector3 ();
orient = new arrayquaternion (); &NBSP;&NBSP;&NBSP;&NBSP} ~arraytransformsoa () {
delete pos;
delete scale;
delete orient;
&NBSP;&NBSP;&NBSP;&NBSP} void move (Vector3 move) { //xxxxyyyyzzzz
Float *soa = reinterpret_cast<float*> (POS); for (int i = 0; i < 3; i++) { for (int j = 0; j < 4; j++ ) { soa[i * 4 + j] +=
Move[i]; } &NBSP;&NBSP} } void setpos (Float x, float y , float z) { pos->setindex (
MINDEX,&NBSP;X,&NBSP;Y,&NBSP;Z); &NBSP;&NBSP;&NBSP;&NBSP} vector3 getpos () {
return pos->getindex (Mindex);
}; Void soavaos () { //aos (arrays of structure) Transform aosArray[4];
ArrayTransformSoA soaArray;
vector3 moveadd (4.0f, 2.0f, 1.0f); for (int i = 0; i < 4; i++) { aosarray[i].move (MOVEADD);
&NBSP} soaarray.move (Moveadd); for (int i = 0; i < 4; i++) { cout << aosArray[i].pos.x
<< endl;
soaArray.mIndex = i; cout << soaarray.getpos () .x <<
endl;
&NBSP;&NBSP;&NBSP;&NBSP} cout << "" << endl; }//dod data-oriented design Class arraytransformmanager {private: enum elementtype { pos, scale, orient,
elementcount };
int elements[ElementCount];
vector<char*> memoryPool;
int totalSize = 0;
int maxMemory = 32;
//Current int nextSlot = 0; Public: arraytransformmanager () {
elements[pos] = 3 * sizeof (float);
totalSize += elements[Pos]; &Nbsp; elements[scale] = 3 * sizeof (float);
totalSize += elements[Scale];
elements[orient] = 4 * sizeof (float);
totalSize += elements[Orient];
memorypool.resize (Elementcount); &NBSP;&NBSP;&NBSP;&NBSP} void initialize () { for (int i = 0; i < elementcount; i++) {
int byteCount = elements[0] * maxMemory; memoryPool[i] = new
Char[bytecount]; &nbSp; memset (Memorypool[i], 0, bytecount); } void Createarraytransform (arraytransformsoa &outtransform) {
int current = nextSlot++; //current = 0,nextslotidx = 0,nextslotbase = 0 //current = 3,nextslotidx = 3,nextslotbase = 0 //current = 4,nextSlotIdx = 0,nextSlotBase = 4 //current = 5,nextslotidx = 1, Nextslotbase = 4 //current = 7,nextslotidx = 3,nextslotbase = 4 //current = 8,nextSlotIdx = 0,nextSlotBase = 8 int nextslotidx
= current % 4; int nextSlotBase = current -
Nextslotidx;
outTransform.mIndex = nextSlotIdx; outtransform.pos = reinterpret_cast<arrayvector3* > ( memoryPool[Pos] +
Nextslotbase*elements[pos]); outTransform.scale = reinterpret_cast< Arrayvector3*> (&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp;memorypool[scale] + nextslotbase*elements[scale]); outTransform.orient = reinterpret_cast< Arrayquaternion*> ( memorypool[orient
] + nextslotbase*elements[orient]); outtransform.setpos (nextslotidx, nextslotidx,
NEXTSLOTIDX);
}; Void testdod () { ArrayTransformManager transformDOD;
Transformdod.initialize ();
ArrayTransformSoA transform0;
transformdod.createarraytransform (TRANSFORM0);
ArrayTransformSoA transform1;
transformdod.createarraytransform (TRANSFORM1);
ArrayTransformSoA transform2;
transformdod.createarraytransform (TRANSFORM2); arraytransformsoa transform3;
transformdod.createarraytransform (TRANSFORM3);
ArrayTransformSoA transform4;
transformdod.createarraytransform (TRANSFORM4);
cout << transform0.getpos () .x << endl;
cout << transform1.getpos () .x << endl;
cout << transform2.getpos () .x << endl;
cout << transform3.getpos () .x << endl;
cout << transform4.getpos () .x << endl; }
Help to understand the Simd,dod in ogre
1.arrayvector3 corresponds to ArrayVector3. is a SIMD SOA architecture for use with SSE2 directives.
2.Transform corresponds to Arraytransformsoa. Location information for node.
3.ArrayMemoryManager corresponds to Arraytransformmanager. Used to generate a DOD data structure memory arrangement.
Of course, the function of Arraymemorymanager itself is much more complicated, such as removing slots, tracking deleted slots, automatically choosing to remove slots when adding, and automatic cleaning after too many blank slots in the queue. The simulated arraytransformmanager are not.
The thread is no longer a dispensable function in ogre2.x, nor a toy, nor a simple logic thread, rendering a thread this simple usage. Because there are too many places to use in ogre2.x, the following lists only the Scenemanager:: Updatescenegraph () about the use of new threads, let's look at the basics.
Look at the following code first.
Void scenemanager::startworkerthreads () {#if ogre_platform != ogre_platform_emscripten
mworkerthreadsbarrier = new barrier ( mNumWorkerThreads+1 );
mworkerthreads.reserve ( mNumWorkerThreads ); for ( size_t i=0; i<mNumWorkerThreads; ++i ) { threadhandleptr th = threads::
CreateThread ( thread_get ( updateWorkerThread ), i, this );
mworkerthreads.push_back ( th ); &NBSP;&NBSP;&NBSP;&NBSP} #endif} unsigned long updateworkerthread ( threadhandle * threadhandle ) { SceneManager *sceneManager = reinterpret_cast<
Scenemanager*> ( threadhandle->getuserparam () ); return&nbSp;scenemanager->_updateworkerthread ( threadHandle );
} thread_declare ( updateWorkerThread ); Unsigned long scenemanager::_updateworkerthread ( ThreadHandle *threadHandle ) {#if ogre_platform != ogre_platform_emscripten size_t threadidx =
threadhandle->getthreadidx (); while ( !mExitWorkerThreads ) {
mworkerthreadsbarrier->sync (); if ( !mExitWorkerThreads ) {#else size_t threadIdx = 0; #endif switch ( mRequestType ) { &nBsp;case cull_frustum:
cullfrustum ( mCurrentCullFrustumRequest, threadIdx );
break; case update_all_animations:
Updateallanimationsthread ( threadIdx );
break; case update_all_transforms:
Updatealltransformsthread ( mUpdateTransformRequest, threadIdx ); break; case update_all_bounds:
Updateallboundsthread ( *mUpdateBoundsRequest, threadIdx );
break; case update_all_lods: updatealllodsthread (
mupdatelodrequest, threadidx );
break;
case update_instance_managers: UpdateinstancemanagersThread ( threadIdx );
break; &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;CASE&NBSP;BUILD_LIGHT_LIST01:
BUILDLIGHTLISTTHREAD01 ( mBuildLightListRequestPerThread[threadIdx], threadIdx );
break; case build_light_list02:
BUILDLIGHTLISTTHREAD02 ( threadIdx );
break;
case user_uniform_scalable_task: musertask->execute (
threadIdx, mNumWorkerThreads );
break; default:
break; &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP #if OGRE_PLATFORM != Ogre_platform_emscripten
Mworkerthreadsbarrier->sync (); &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP}  } #endif return
0; } void scenemanager::fireworkerthreadsandwait (void) {#if ogre_platform == ogre_platform_
Emscripten _updateworkerthread ( NULL ); #else mworkerthreadsbarrier->sync () //fire threads Mworkerthreadsbarrier->sync (); //wait them to complete #endif} Void scenemanager:: Updatescenegraph () { //todo: enable auto tracking again, first manually update the tracked scene nodes for correct math. ( Dark_sylinc) // Update scene graph for this camera ( Can happen multiple times per frame) /*{ // auto-track nodes
autotrackingscenenodes::iterator atsni, atsniend;
atsniend = mautotrackingscenenodes.end (); for (Atsni = mautOtrackingscenenodes.begin () atsni != atsniend; ++atsni) { (*ATSNI)->_
AutoTrack (); } // Auto-track camera if required camera->_
AutoTrack (); }*/ ogreprofilegroup ("Updatescenegraph", OGREPROF_GENERAL)
; // Update controllers controllermanager::
Getsingleton (). Updateallcontrollers ();
highlevelcull ();
_applysceneanimations ();
updatealltransforms ();
updateallanimations ();
#ifdef ogre_legacy_animations updateinstancemanageranimations (); #endif &NBSP;&NBsp; updateinstancemanagers ();
updateallbounds ( mEntitiesMemoryManagerUpdateList );
updateallbounds ( mLightsMemoryManagerCulledList ); { // auto-track nodes AutoTrackingSceneNodeVec::const_iterator itor =
Mautotrackingscenenodes.begin (); autotrackingscenenodevec::const_iterator end =
mautotrackingscenenodes.end (); while ( itor != end ) { itor- >source->lookat ( itor->target->_getderivedposition () + itor->offset,
Node::TS_WORLD, itor->localDirection ); itor->source->_
Getderivedpositionupdated ();
++itor; } { // auto-track camera if required
cameralist::const_iterator itor = mcameras.begin (); CameraList::const_iterator end =
Mcameras.end (); while ( itor != end ) { (*itor)->_autotrack ();
++itor; }
Buildlightlist (); //reset the list of render rqs for all cameras that are in a PASS_SCENE (except shadow passes)
uint8 numRqs = 0; { objectmemorymanagervec::const_
Iterator itor = mentitiesmemorymanagerculledlist.begin (); objectmemorymanagervec::const_iterator end =
mentitiesmemorymanagerculledlist.end (); while ( itor != end ) {&NBsp; numrqs = std::max<uint8>
( numRqs, (*itor)->_gettotalrenderqueues () );
++itor; &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP} } cameralist::
Const_iterator itor = mcameras.begin ();
cameralist::const_iterator end = mcameras.end (); while ( itor != end ) {
(*itor)->_resetrenderedrqs ( numRqs );
++itor; &NBSP;&NBSP;&NBSP;&NBSP} // reset these
Mstaticmindepthleveldirty = std::numeric_limits<uint16>::max (); mstaticentitiesdirty = falSe
for ( size_t i=0; i<OGRE_MAX_SIMULTANEOUS_LIGHTS; ++i ) mautoparamdatasource->settextureprojector ( 0, i
); } void scenemanager::updatealltransformsthread ( const updatetransformrequest &request, size_t threadIdx ) { transform t ( request.t ); const size_t toadvance = std::min ( threadIdx * Request.numnodesperthread,
request.numTotalNodes ); //Prevent going out of bounds (usually in the Last threadidx, or //when there are less nodes than array_packed_reals const size_t numnodes = std::min ( request.numNodesPerThread,
request.numtotalnodes - toadvance );
t.advancepack ( toAdvance / ARRAY_PACKED_REALS );
node::updatealltransforms ( numNodes, t ); }//-----------------------------------------------------------------------Void scenemanager:: Updatealltransforms () { mRequestType = UPDATE_ALL_TRANSFORMS;
nodememorymanagervec::const_iterator it = mnodememorymanagerupdatelist.begin (); NodeMemoryManagerVec::const_iterator en =
Mnodememorymanagerupdatelist.end (); while ( it != en ) { nodememorymanager *nodememorymanager = *it; const size_t numDepths =
Nodememorymanager->getnumdepths (); size_t start = nodeMemoryManager->
Getmemorymanagertype () == scene_static ?
mstaticmindepthleveldirty : 1; //Start from the first level (not root) unless static (start from first dirty) for ( size_t i=start; i<numDepths; ++i ) {
Transform t; const size_t numnodes =
nodememorymanager->getfirstnode ( t, i ); //nodesperthread must be multiple of array_packed_reals size_t nodesPerThread = ( numNodes + (mNumWorkerThreads-1) )
/ mNumWorkerThreads; nodesPerThread = ( (nodesperthread + array_packed_reals - 1) / ARRAY_PACKED_REALS ) *
ARRAY_PACKED_REALS; //send them to worker
threads (dark_sylinc). we need to go depth by depth because //we may depend on
parents which could be processed by different threads. mUpdateTransformRequest =
Updatetransformrequest ( t, nodesPerThread, numNodes );
fireworkerthreadsandwait (); //node::updatealltransforms (
numnodes, t ); } ++it; &NBSP;&NBSP;&NBSP;&NBSP} //call all listeners
Scenenodelist::const_iterator itor = mscenenodeswithlisteners.begin (); scenenodelist::const_iterator end = mscenenodeswithlisteners.end (
); while ( itor != end ) {
(*itor)->getlistener ()->nodeupdated ( *itor );
++itor; &NBSP;&NBSP;&NBSP;&NBSP}} scenemanager::updatescenegraph
Let's assume there are n worker threads. The following is an analysis of this code:
Barrier type Mworkerthreadsbarrier object: Synchronous main thread and worker thread, simulated with two semaphores. (We use it to interchange the semaphore.) Otherwise, if one of the multiple worker threads causes this to be trapped in the current thread's sync point, it will eventually deadlock.
Mnumthreads: Work threads and mainline threads and, n+1
Mlockcount: Number of currently locked threads.
Worker thread execution Method Updateworkerthread-> _updateworkerthread (dead Loop)
The main line Chengri fireworkerthreadsandwait and the _updateworkerthread in the worker thread will call the Mworkerthreadsbarrier->sync () method two times. as follows we simplify Mworkerthreadsbarrier->sync () for the sync method, if no special instructions, sync refers to Mworkerthreadsbarrier->sync ().
Creates a barrier object mworkerthreadsbarrier with a semaphore of 0.
Creates n engineering threads, because the current semaphore has a value of 0, so the first sync method in the worker thread _updateworkerthread loop causes the current semaphore to WaitForSingleObject. The final number of locked threads is mlockcount N.
When the main thread updates the scene, it is assumed that after the updatescenegraph->updatealltransforms above, the update is identified as Cull_frustum, calling Fireworkerthreadsandwait the first sync method in the Mworkerthreadsbarrier, the condition mlockcount== mnumthreads is reached, and the current semaphore is reset to the number of worker threads N ( Then switch to the next semaphore, the next semaphore value is 0. This allows the worker thread under the current semaphore to perform work (_updateworkerthread->updateallanimationsthread).
When the main thread and n worker threads pass the first sync method, to perform the task, each reached the second sync method in the thread before the first n threads (the main thread may also be inside) to the next semaphore, the signal value is 0, so everyone waits until the last thread is finished, To the two sync method, which is similar to the third step, because mlockcount== mnumthreads, the current semaphore is reset to the number of worker threads N (and then to the next semaphore, the next semaphore value is 0). This means that all threads are crossing the second sync method in the current semaphore. The worker thread is the first sync in the next loop that executes the current loop. This is where the current semaphore comes in, because the semaphore is 0, So WaitForSingleObject Jam. And the 2nd step state.
Then repeat the process, and some of the equality relationships can be updated with worker threads, and the main thread calls Fireworkerthreadsandwait's first sync method. and then repeat it. But to point out that the order of the threads is indeterminate, Not only is the fourth step the last to be indeterminate, it may also be in step three, since the main thread executes the fireworkerthreadsandwait, and then executes the The first sync of the fireworkerthreadsandwait may also be completed before the first sync of the next loop in the second sync of the worker thread. But this is really not a relationship, because we require synchronization is not trouble, only start together, and then wait until the end, This is satisfying.
This is just one example of the new threading scheme in Ogre, which we can see in the rendering process in ogre, all nodes are updated, all animations, all models aabb all threaded, and the combination of the internal data of these methods is a DOD-corresponding SOA structure, The DoD advantage described in the previous DoD link is easier to parallelization and better cache hits. The previous slide document has specific comparisons for DoD and Ood cache hits.
Hlms: Just a brief introduction
Ogre2.0 has given up the FFP, but originally is a should give up the east, in the Ogre1.9 can replace FFP with RTSS components, but in Ogre2.0 is really absolutely no, the relevant API is not, that is not to say to simply render a model to write shader code, Or must use the RTSS, this is not, we need to use the newest advanced material system Hlms. Hlms can be said to combine the original material and RTSS of the new core functions, using more convenient and flexible and efficient.
When describing the new hlms, I think it is necessary to first explain the rendering assembly line, this is Bo friends Bright garden OpenGL pipeline (with the classic pipeline to say the shader inside), this article FFP and programmable pipelines have instructions, for example, Hlms using a block scheme, so there are many benefits, The first each block state can be reused, reduce memory and bandwidth, and improve cache hit. The second D3D and OpenGL are state machine mode, using block mode, can combine the same block to render, reduce state switching, improve rendering efficiency. That's why the authors say the original material is inefficient, Not recommended, and the author deliberately stated that this block pattern looks like D3d11, but the author itself is an OpenGL fan, he developed this hlms is always under OpenGL, can only say, the author said D3d11 developers think together.
Macroblocks are raster states that contain depth of read/write settings and exclude patterns. The blendblocks is like a d3d11 mixing state, containing mixed patterns and their influencing factors. Samplerblocks is like d3d11 in GL3 + or sampling state sampling objects that contain filtering information, texture addressing modes (packets, cards, etc.), texture settings, and so on.
Macroblocks Block: Contains depth checking in piecemeal processing, as well as culling model, display pattern, similar to id3d11rasterizerstate in D3d11.
Blendblocks BLOCK: Alpha blending operations in piecemeal processing. similar to Id3d11blendstate.
Samplerblocks BLOCK: A collection of properties for a texture block. Similar to D3d11_sampler_desc.
Datablocks BLOCK: This ogremain not how to reflect out, should go to see OGREHLMSPBS, the corresponding folder Media\hlms\pbs, you can see what is going on. Datablocks with renderable to populate the shader code, like RTSS.
Contains all of the above blocks, bearing the equivalent of the original material items. Examples such as the original ogre1.x old model renderable originally setmaterial, now the new model renderable use Setdatablock.
The original material, such as surface color and other effects vertex shader and fragment shader between the attributes are assigned to Datablocks, just start to see Alaph_test and other related settings inside also puzzled, the back directly in Media\hlms\pbs view, such as GLSL in the PIXELSHADER_PS.GLSL can be directly based on the corresponding alaph_test settings can be discarded fragments, and piecemeal processing of alphatest like, here a bit still not clear, is piecemeal processing to give up the alphatest or in advance to the fragment shader processing, after all, piecemeal processing is after the fragment shader. Piecemeal processing of other processing such as on the separate block.
Summarize:
These are just a few translations of two documents, just a brief introduction to some of the new features in ogre2.x, and from this small part, we've seen that this is a completely different ogre engine, followed by a detailed analysis of the following.
1. New rendering process.
2. New model format and introduction of Vao
3.HLMS detailed.
4. New synthesizer detailed.
5. The new thread is detailed.
Ogre2.1 combined with opengl3+ efficient rendering
Before DX10 and opengl3+, both were a mixture of fixed pipelines and programmable pipelines, in which the ogre1.x version was also combined with a fixed and programmable pipeline design. After the opengl3+ and the DX10, the fixed pipelines were removed, The function of the relative shader is further perfected and expanded, corresponding to the ogre2.x packaging DX11 and opengl3+, completely discarding the fixed pipeline content, specifically for the programmable pipeline encapsulation.
Ogre1.x's rendering process has always been the object of everyone spit, except with the ogre1.x itself of the instance batch, to the same material with the model, but the use of people know that this limitation is too large, in addition to each renderable combined with a pass method, one is a large number of state switching, The second is a large number of drawcall. These two points should be said to be the main reason why ogre1.x performance has been low. In ogre2.x, we benefit from the existing process improvement, reduce state switching, and benefit from the process improvement and the introduction of new APIs, reduce the Drawcall.
As mentioned in the previous document, no instance batches are used, you can combine mesh, as well as different mesh, when you see, think the document is wrong, or their own understanding, did not dare to write, now look at the relevant code, have to say that the current rendering design is too bull (combined with the latest API), with mesh merger is nothing, Different mesh together into a drawcall, too bull, and do not you write whether the instance batch, such as the manual instance batch in ogre1.x, is now fully automated.
For instance, in Ogre2.1, the following code.
for (int i = 0; i < 4; ++i) { for (int j = 0; j <&NBSP;4;&NBSP;++J) {
Ogre::String meshName; if (I&NBSP;==&NBSP;J) meshName = "
Sphere1000.mesh "; else
meshName = "Cube_d.mesh"; Ogre::Item *item = Scenemanager->createitem (Meshname, &nbsP; ogre::resourcegroupmanager:: autodetect_resource_ Group_name,
Ogre::scene_dynamic); if (i % 2 == 0) item->
Setdatablock ("Rocks"); else
item->setdatablock ("Marble"); item->setvisibilityflags (
0X000000001); size_t idx = i * 4 + j; mSceneNode[idx] = Scenemanager->getrootscenenode (ogre::scene_dynamic)->
createchildscenenode (ogre::scene_dynamic); mscenenode[idx]->setposition ((i - 1.5f) * armslength, 2.0f,
(j - 1.5f) * armslength); mscenenode[idx]->setscale (0.65f,
0.65f, 0.65f); mscenenode[idx]->roll (Ogre::Radian (
(ogre::real) idx)); mscenenode[idx]->attachobject (item); } }
As shown above, there is a 4*4 model, where a diagonal is all spherical, the rest is all cubes, where even the number of rows using the material rocks, odd rows using marble. Call Gldraw ... (Drawcall) The number of times only need two or four times, look at the hardware support, how to do it, in Ogre2.1, as the 16 models added into the rendering channel, according to the material, model and so on to generate sort IDs, as in the order of roughly rocks[sphere0-0, Sphere2-2,cube0-1,cube0-2,cube0-3,cube2-1 ...], marble[sphere1-1,sphere3-3,cube1-2,cube1-3 ...]. The eight models in rocks only need one or two times drawcall,marble is the same. Ogre2.1 how to do this, see the new API in the relevant opengl3+.
instances and indirect drawing APIs
void Gldrawarraysinstancedbaseinstance (Glenum mode, Glint, Glsizei count, Glsizei InstanceCount, Gluint Baseinstance); Non-indexed direct rendering
Draws its primcount instance for the set of geometric entities that are formed by mode, first, and count (equivalent to the independent parameters required by the Gldrawarrays () function). For each instance, the built-in variable gl_instanceid is incremented, and the new values are passed to the vertex shader to distinguish the vertex properties of the different instances. In addition, the Baseinstance value is used to set an index offset value on the instantiated vertex property to change the index position of the OpenGL fetch.
void Gldrawelementsinstancedbasevertexbaseinstance (glenum mode, Glsizei count,glenum type, const glvoid* indices, Glsizei InstanceCount, Gluint Basevertex, Gluint baseinstance); Direct drawing of indexes
Draws its primcount instance for the set of geometric entities that are composed of mode, count, indices, and Basevertex (equivalent to the independent parameters required by the Gldrawelementsbasevertex () function). Similar to gldrawarraysinstanced (), for each instance, the built-in variable gl_instanceid is incremented sequentially, and the new values are passed to the vertex shader to distinguish the vertex properties of the different instances. In addition, the Baseinstance value is used to set an index offset value on the instantiated vertex property to change the index position of the OpenGL fetch.
void Glmultidrawarraysindirect (glenum mode, const void* indirect, Glsizei Drawcount, Glsizei Stride); Non-indexed indirect rendering
Draw multiple sets of elements, all related parameters are saved to the cached object. In one call to Glmultidrawarraysindirect (), you can distribute a total of drawcount separate drawing commands that are in line with the parameters in the Gldrawarraysindirect (). The interval between each Drawarraysindirectcommand structure is stride bytes. If the stride is 0, then all the data structures will form a tightly arranged array.
void Glmultidrawelementsindirect (glenum mode, glenum type, const void* indirect, Glsizei Drawcount, Glsizei Stride); Indirect drawing of indexes
Draw multiple sets of elements, all related parameters are saved to the cached object. In one call to Glmultidrawelementsindirect (), you can distribute a total of drawcount separate drawing commands that are in line with the parameters in the Gldrawelementsindirect (). The interval between each Drawelementsindirectcommand structure is stride bytes. If the stride is 0, then all the data structures will form a tightly arranged array.
The links can be seen in the SDK on the website of OpenGL, the following explanation is the eighth version of the Little Red Book. The comparison between the two is easier to understand. The 1th, 22 are direct-drawn versions, 3,4 are indirectly drawn versions of the corresponding 1,2, if the current environment supports indirect rendering, One of the preceding said only need two times drawcall, once the material once drawcall, different mesh can also be a drawcall. The direct drawing version needs 4 times, each material two times drawcall (corresponding two type mesh, each type mesh automatically merges).
In particular, when rendering, the model order in the channel is Rocks[sphere,sphere,cube,cube ...], marble[sphere,sphere,cube,cube ...]. The application of material rocks (that is, binding the corresponding shader code), binding Vbo, the first sphere, generated once drawcall, the second sphere, only need Drawcall instance parameter InstanceCount plus 1, to the first cube, Add one Drawcall parameter (non-indexed version 1,3 to Drawarraysindirectcommand structure, indexed version 2,4 to Drawelementsindirectcommand structure), In this note the Baseinstance changes (in the same material, the value of the model will change), in this is 2 (corresponds to the baseinstace parameter in the above function parameter, which is related to the following drawid). In the direct version, Several drawcall parameters correspond to several drawcall (above 1, 22 APIs). Indirect rendering directly once Drawcall (above 3, 42 APIs) is buttoned up. Then apply the material marble, as in the previous step.
New buffer Operation
In the OPENGL3+,VBO,IBO,UBO,TBO can be placed in the same buffer. So unlike in ogre1.x, using Hardwarebuffer, you generate buffer.ogre2.1, use bufferpacked, itself does not use Glgenbuffer, but is recorded in a large buffer position, GPU-CPU data interaction through the bufferinterface. Because VBO,IBO,UBO,TBO is now unified data management, So the corresponding vertexbufferpacked,indexbufferpacked, constbufferpacked, texbufferpacked contrast to the original Harderwarevertexbuffer, Harderwareindexbuffer, Hardwareuniformbuffer, Hardwarepixelbuffer is much simpler to deal with, generate buffer to vaomanager complete, The GPU-CPU interaction is accomplished through bufferinterface, and the original Hardwarebuffer each processes the generation BUFFER,GPU-CPU data interaction. Originally, the BUFFER was divided into Gl_array_buffer, Gl_ Element_array_buffer, Gl_uniform_buffer, Gl_texture_buffer and so on. In opengl3+, BUFFER is a place where data is stored, regardless of type, You can put whatever you want. Among them Ubo and TBO because to the different shader different binding index, realize and VBO and IBO a bit different, look at the constbufferpacked,texbufferpacked related code to understand.
Buffertype corresponding to the GPU and CPU operation permissions, different permissions for different implementations, simply say.
Bt_immutable GPU only Read permission, CPU does not have permissions. General texture and model data use.
Bt_default GPU has read and write permissions, the CPU does not have any permissions, RTT (FBO and other technologies), Transform feedback use.
Bt_dynamic_default GPU readable, CPU writable. Commonly used in particle systems, often update buffer blocks (such as UBO) and so on.
Bt_dynamic_persistent with Bt_dynamic_default, the difference is that when the buffer processing mapped state, can also carry out the client's reading and writing operations, such as Gldrawelements.
Bt_dynamic_persistent_coherent, unlike Bt_dynamic_persistent, is that the GPU can get the latest data immediately after the CPU modifies the data.
Which 3,4,5 specific to see Bo friends to upgrade advanced OpenGL (III): persistent-mapped buffer in the buffer storage,ogre2.1 also use buffer Storage to enhance efficiency. Buffer storage One is only need once map, keep related pointers, do not need multiple map and unmap, improve efficiency (so also known as continuous mapping buffer). The second is to provide more control, such as the Buffertype enumerated above.
In the Vaomanager of gl3+ in Ogre2.1, in initialization, the default bt_immutable and Bt_default add up to 128M, the remaining bt_dynamic_default, bt_dynamic_persistent , bt_dynamic_persistent_coherent per block 32M, because Bt_immutable and bt_default of which the CPU does not have permissions, so the unified treatment.
In this first assume that the current environment support Buffer Storage, the back Vbo,ibo,ubo,tbo are GL3PLUSVAOMANAGER::ALLOCATVBO to allocate, simple to say this function, the beginning as to Bt_immutable and Bt_ The default is assigned the minimum 128M, and the remaining buffertype is the smallest 32M. According to the different buffertype corresponding glbufferstorage use different flags. Every time a different buffertype comes in the back, Find out if the corresponding block is still allocated space. If so, separate blocks, then the corresponding bufferpacked records are assigned to the starting point. The MVBOPOOLIDX in the corresponding Bufferinterface records the index of the block in 128M buffer, and Mvboname is the ID of the 128M buferr.
When using CPU-side data to update the GPU, call Bufferinterface::map.
where bt_immutable and Bt_default have no flag-gl_map_write_bit, can not direct MAP, use class Stagingbuffer indirectly complete cpu->gpu-> GPU this conversion. Stagingbuffer::map data from the CPU->GPU, and then through the Stagingbuffer::unmap, the current GPU to move the data to the final GPU location, the replication data between the cache for Gl_copy_ Read_buffer and Gl_copy_write_buffer, for more specific searches of these two keywords, can also be learned from the Gl3plusstagingbuffer class. From the above, this step is lighter than the transmission, It is best not to easily modify the bt_immutable and Bt_default type of buffer, generally only initialized when the data passed in.
The remaining buffertype types such as Bt_dynamic_default, as mentioned above, use the buffer Storage, only the map one time to keep the pointer to the MMAPPEDPTR. The dough map uses this mmappedptr to update data directly, Related update process with the help of class Gl3plusdynamicbuffer, this class has annotations, because gl3+ cannot simultaneously mapping (that is, no unmap, all in map) a buffer. It is learned from the above that repeated updates of the buffer block should be used in this way, Updating data is very fast.
If the current environment does not support the buffer Storage, it is handled as ogre1.x. Use Glbufferdata, when bt_immutable and Bt_default, corresponding flag for Gl_static_draw, otherwise Gl_ Dynamic_draw. When the CPU data is updated on the GPU, bt_immutable and Bt_default are treated ditto, the remaining buffertype because there is no buffer Storage, You need to call Glmapbufferrange again each time you update the data.
render later related classes and processes
Knowing how the new buffer works, we can look at the following related classes and then show how to render them through these classes.
Vertexarraobject (Encapsulation Vao): Vao different Vbo is a piece of buffer,vao should be said to be saved the corresponding Vbo,ibo binding information, and the corresponding vertex glvertexattribpointer state. In Ogre2.1, As mentioned above Vbo,ibo,ubo,tbo are stored in a buffer, so in general, create a model (the model can have multiple Submesh, a submesh corresponding to a vao) corresponding Vao, the same number of Submesh, Mvaoname and Mrenderquereid are the same. Different multiple Submesh, mvaoname are generally the same, and Mrenderquereid different.
Mvaoname Vao ID, corresponding to a vertex layout, the layout is in OpenGL middle finger rendering type (dots, lines, triangle bands, etc.), VBO and IBO, index type (16bit-32bit), vertex properties (glvertexattribpointer). If multiple Submesh use the same vertex layout (which is common in Ogre2.1, because multiple Vbo,ibo generally share a buffer, so long as the vertex format is the same as the same layout), then you can share a vao, and this situation is very common.
Mrenderquereid a uint32 number of segments, which are divided into two segments, the first paragraph being 0-511 (8 bits), representing the ID of the current vaomanager (an ID after a call createvertexarrayobject), The latter section is 512-uint32.maxvalue (24 digits), representing the corresponding mvaoname. This design can be sorted according to the number of paragraphs, such as in this, Mvaoname different, two values will be very large, and mvaoname the same, the creation of Vertexarraobject ID different, the value of a small difference, and the second is so in the creation of a mesh more than Submesh ( Vaomanager's ID plus 1), the same submesh is generally lined together.
Renderable: As with ogre1.x, the material and data are associated in the render channel. The difference is that the material is no longer material (corresponding to the property setting in the fixed line), but rather hlmsdatablock (primarily for generating the corresponding shader code), The data is no longer directly associated with VBO and IBO objects, but with Vao. Mhlmshash, like the Mrenderquereid above, is a segmented number, which is divided into two paragraphs, the first paragraph being 0-8191 (12 digits), Represents the index in the list of render attribute combinations for the current Hlms type. The rendering properties include whether the skeleton is animated, the number of textures, whether the alpha test is enabled, whether the model is enabled, the view, the pivot matrix, and so on. The latter section is Hlms type, such as PBS (based on physical rendering,), unlit (no illumination , for GUI, particle, spontaneous light), toon (cartoon coloring), Low_level (Ogre1.9 material rendering mode).
Queuedrenderable: In the original ogre1.x, the render channel is renderable and corresponding pass, and now the render channel holds the queuedrenderable. Where the hash in the queuedrenderable Mainly used for sorting in the channel, is a unit64 number of segments, in the case of non-transparent divided into seven paragraphs, where the texture accounted for 15-25 bits, Meshhash accounted for 26-39, Hlmshash (corresponding renderable mhlmshash) accounted for 40-49, Transparent for 60-60 bits (bool type only one), channel ID occupies 61-64 bits, more details see renderqueue::addrenderable this method. So we sort after, according to the channel ID, then transparent, material, model, texture sort, This is very important, in the back rendering, this order will ensure that the model can be properly combined rendering, and ensure minimal state switching, improve efficiency.
Hlmscache:hlms generates the corresponding shaders according to the Mhlmshash in the renderable (the index in the list of render properties in Hlms), see Hlms::createshadercacheentry for details.
The hash is the same as before, segment, Unit32, the front 15 bits represent the hash in the current specific hlms type of Hlmscache, and the following 17 digits represent Mhlmshash in the corresponding renderable.
Type represents hlms types, such as PBS (based on physical rendering,), unlit (no illumination, for GUI, particle, spontaneous light), toon (cartoon coloring), Low_level (Ogre1.9 material rendering mode)
Shader: A variety of shaders that are generated from a particular hlms type, with vertices, geometry, subdivision surfaces, fragments.
With these classes, we review the initial 16-ball problem, how to sort, how to merge, and simply explain the rendering process.
The current camera retrieves the scene, retrieving all visible renderable. Generate piecewise hash (for sorting, in which the material, then mesh) is generated according to the renderable material (in this is Hlmsdatablock, not ogre1.x pass) And the related renderable, piecewise number of hash, the corresponding movableobject packaging into the thread rendering channel, merge all the current thread rendering channel to the current channel.
Then start rendering the model in the channel, according to the current renderable generation Hlmscache, according to the renderable material Mhlmshash, find the corresponding material all attributes, Populates the shader code in the Hlmscache with the current type of hlms. Only one generation, the corresponding Hlmscache will be cached.
And then, as mentioned earlier, Vao different, in general, the material is different, need to rebind Vao (note that DX11/12 required), and then generate a drawcall. There are multiple models under the same material (Mvaoname the same) if later models are the same as the previous models (Mrenderquereid the same), just add the number of instances of the current Drawcall parameter to 1, if the same as the previous one (Mrenderquereid different), then increase the corresponding Drawcall parameter structure, where the environment supports indirect rendering, Then all the parameters are merged into a structure array rendering so that multiple instances and multiple different models can be rendered at once, or one instance of multiple models is rendered at a time.
We know that there are multiple models in the instance where the positions are generally different. At the beginning of the corresponding Vaomanager initialization, a 4,096 drawid (UInt32, storage 0,1 ...) is generated. 4095 buffer, via Glvertexattribdivisor (Drawid) and baseinstance (see previous 1, 22 APIs). We put multiple models from multiple instances into TBO, So many drawcall use this TBO (so to use baseinstance), first set the corresponding vertex properties Drawid glvertexattribdivisor to 1, so that each instance corresponds to each drawid, In each instance, Darwid is stored in an array buferr from 0 per 1, which achieves a similar effect to Gl_instanceid, baseinstance used to generate Drawcall for each drawid (because Drawcall share Tbo, The drawid of different instances need to increase the baseinstance displacement), so that the drawid can be used as the index to obtain the model matrix deposited in the TBO, also according to Drawid to take the other contents of the shared Tbo (Gl_instanceid similar, However, Baseinstance does not affect the value of Gl_instanceid, some of which are generated by a HLMSPBS vertex shader code.
#version 330 core #extension gl_arb_shading_language_420pack: require Out gl_pervertex {
vec4 gl_position;
};
Layout (std140) uniform; MAT4&NBSP;UNPACK_MAT4 ( samplerBuffer matrixBuf, uint pixelIdx ) { vec4 row0 = texelfetch ( matrixbuf, int (PIXELIDX) << 2u) )
; vec4 row1 = texelfetch ( matrixbuf, int (PIXELIDX) <
<&NBSP;2U) + 1u) ); vec4 row2 = texelfetch ( matrixbuf, int (PIXELIDX) <
<&NBSP;2U) + 2u) ); vec4 row3 = texelfetch ( matrixbuf, int (PIXELIDX) <
<&NBSP;2U) + 3u) ); RETURN&NBSP;MAT4 ( row0.x, row1.x, row2.x, row3.x, row0.y, row1.y,&Nbsp;row2.y, row3.y, row0.z, row1.z, row2.z, row3.z,
row0.w, row1.w, row2.w, row3.w ); } mat4x3 unpack_mat4x3 ( samplerBuffer matrixBuf, uint pixelIdx ) { vec4 row0 = texelfetch ( matrixbuf, int (PIXELIDX) << 2u)
); vec4 row1 = texelfetch ( matrixbuf, int (PIXELIDX) <
<&NBSP;2U) + 1u) ); vec4 row2 = texelfetch ( matrixbuf, int (PIXELIDX) <
<&NBSP;2U) + 2u) ); return mat4x3 ( row0.x, row1.x, row2.x, Row0.y, row1.y, row2.y, row0.z, row1.z, row2.z, row0.w, row1.w,
row2.w );
} in vec4 vertex;
in vec4 qtangent;
in vec2 uv0;
in uint drawid; Out block { flat uint drawId;
vec3 pos;
vec3 normal;
vec2 uv0; } Outvs;
Struct shadowreceiverdata {mat4 texviewproj; vec2 shadowdepthrange; vec4 invShadowMapSize;};
struct light {vec3 position; vec3 diffuse; vec3 specular;}; Layout (binding = 0) uniform passbuffer {mat4 viewproj; mat4 view; mat3
Invviewmatcubemap;
light lights[1];
} pass;
Layout (binding = 0) uniform samplerBuffer worldMatBuf; Vec3 xaxis ( vec4 qQuat ) {float fty = 2.0 * qquat.y; float
fTz = 2.0 * qQuat.z;
float ftwy = fty * qquat.w;
float ftwz = ftz * qquat.w;
float ftxy = fty * qquat.x;
float ftxz = ftz * qquat.x;
float ftyy = fty * qquat.y;
float ftzz = ftz * qquat.z; RETURN&NBSP;VEC3 ( 1.0-(Ftyy+ftzz), &NBSP;FTXY+FTWZ,&NBSP;FTXZ-FTWY&NBSP;); } void main () {mat4x3 worldmat = unpack_mat4x3 ( worldMatBuf, drawId <<
&NBSP;1U); MAT4&NBSP;WORLDVIEW&NBSP;=&NBSP;UNPACK_MAT4 ( worldMatBuf, (drawid << 1u) +
1u );
VEC4&NBSP;WORLDPOS&NBSP;=&NBSP;VEC4 ( (Worldmat * vertex) .xyz, 1.0f ); Vec3 normal = xaxis ( normalize ( qtangent
) );
outvs.pos = (Worldview * vertex). xyz;
OUTVS.NORMAL&NBSP;&NBSP;&NBSP;&NBSP;=&NBSP;MAT3 (WorldView) * normal;
gl_position = pass.viewproj * worldpos;
outvs.uv0 = uv0;
outvs.drawid = drawid; }
The
Related API is primarily about OpenGL, and DX has a corresponding API.