"Batch,batch,batch": what does it really mean?

Source: Internet
Author: User


Date: 2016/06/18

Source: CSDN

Topics: Batch, drawl call, performance


Recently, there is a question of why in 3D graphics programming, the number of draw call is always used to estimate performance, what does draw call do? What does it have to do with GPU,CPU? With this question, the Internet search for the relevant articles, found in the StackOverflow on the discussion on this. After reading their discussions in detail, it was found that an nvidia article "Batch,batch,batch": what does it really mean? ". This article will primarily document some of the understanding of this article. As the author put it, this article is very old, whether it is inconsistent with the actual situation, but the overall can give me a preliminary understanding of draw call. Well, don't say much nonsense, start the text!!!

Let ' s go

Figure 1

Figure 2
At the beginning, the article proposes what batch is, and whenever we call a function like DrawPrimitive in the API, we actually submit the triangle (generally) data to the GPU, a batch with the same rendering state, the same texture, The same transform.

Figure 3
Here, he raises a question, the game draws 1 million objects, and each object has 10 triangles of performance and draws 10 objects, each object has 1 million triangles, which performance is good, which is faster? From books and other ways, we all know that the latter is actually high efficiency. But the reason for the high is that there are some common error guesses in this article:
(1) The state switch on the GPU is not fast enough (wrong) (2) The organization of the triangle on the GPU is very resource-intensive (wrong) (3) data transfer in the kernel is slow (wrong)
After reading this, I knew that my understanding of the original was wrong. In addition to the error guesses above, the question "will the GPU of the future solve the problem of drawing equally efficiently in both of these plotting situations?" ”

Figure 4

Figure 5
The author says, do not speculate, we actually write code, test to see, the answer will know. By writing a test code, just draw some very simple triangle data, no lighting, no textures, no unnecessary overhead, just submit batch. In Figure 5, the results of the measurement are given. From here, we can understand some basic parameters, the horizontal axis is Triangles/batch, indicating the number of triangles in each batch. The longitudinal axis is the million triangles/s, which represents the millions triangle data that 1s can draw. Different GPUs, the same CPU, in different sizes of batch, the efficiency of the drawing is obviously different, and is not the linear relationship we imagined, but at some point in time, the mutation, the rendering of the moment to improve.

Figure 6
Based on this, we can draw some optimized solutions, choose the appropriate batch size, can greatly improve performance.

Figure 7
Figure 7 shows a conclusion, from which it can be seen that when the size of batch is less than 130, the GPU is actually far from full load, inefficient reason on the CPU side, the CPU does not have the means to submit more batch to the GPU for processing.

Figure 8
From the previous experiment, as well as the results obtained, you can see that the entire drawing efficiency bottleneck is on the CPU side, not on the GPU. That is, the CPU does not handle enough data, which is the direct cause of drawing efficiency bottlenecks, not because of the relationship of our batch size. Below, let's count the number of batch per second that can be submitted.

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13
From the chart above, it can be concluded that the number of batch sizes that can be submitted per second is actually constant for the same CPU under different GPUs and batch size. Different CPUs do not commit the same number of batch submissions per second. This means that the whole system can draw the number of batch per second, only the CPU, and the size of batch, the type of GPU and so on regardless.

Figure 14
From the previous analysis, we will be able to know that the bottleneck is really completely at the end of the CPU. The CPU has been busy submitting batch to the GPU.

Figure 15
This graph gives a CPU's CPU resource distribution when processing batch size in 2 triangles, 78% of the resources are driven, the other 14% is D3D occupied, and the rest is occupied by other parts. The driver actually does very little work at each draw and state change, but if the work is very frequent, this part also takes up a lot of resources. Graphics drivers are always optimized, but no matter how they are optimized, they require a lot of resources. For CPUs, the CPU cycles are linearly related to the number of batch submissions. The time complexity of the CPU processing batch is difficult to reduce to the constant level.

Figure 16
As mentioned earlier, the bottleneck of the entire drawing system is on this side of the CPU, not the GPU. As you can see, the GPU itself is much faster and faster than the CPU. If your batch size is small, and the CPU spends less time driving on the batch, and the GPU is processing the data too quickly, it will cause the GPU to be idle, and the total number of triangles drawn per second will certainly decrease. If your game requires a fixed number of triangles, then it takes more time to finish drawing, which naturally results in a drop in frame rate. So, in order not to waste the CPU time spent on the drive, we try to provide more triangular data for each batch, because the GPU is faster and does not cause the CPU to wait for the GPU to finish processing the data (this is the case if there is a complex shader calculation in this experiment). The number of triangles that the system can draw will naturally improve so much that the game scene can be drawn more quickly.

Figure 17
Since the CPU time to submit batch is mainly on the drive and D3D runtime, so when we optimize the driver and D3D runtime, improve the CPU calculation, we can submit the batch quantity per second will naturally improve a lot. The GPU actually handles the resulting triangle data, which means that the size of batch actually affects the GPU, and when the GPU speeds up, we are naturally able to use the larger size batch. It is also important to understand that the GPU is much faster than the CPU, which is why the size of batch does not affect the CPU.

Figure 18

Figure 19
So the size and performance of batch doesn't really matter much. When we are writing games, it is not possible to organize all the objects in large batch, small size batch, no impact. The size of the batch size should be set appropriately based on the number of triangles that the game ultimately needs to draw the scene, the number of batch submissions per frame, and the speed of the GPU. The number of batch submissions per frame depends on the speed of the CPU, the target frame rate, and the CPU cycles we reserve to submit batch.

Figure 20
So, when we know the number of batch n that can be submitted per second when a CPU is fully loaded, and the ratio of CPU resources reserved to r, and the target fps,f, we are able to pass the formula
X = N * r/f
Get the batch quantity that should be submitted per frame. And we can then design our game, arrange the data reasonably, submit the appropriate size batch, and finish the scene drawing.

Figure 21

Figure 22

Figure 23
So the size of batch how to decide, it depends on our own decision. If the GPU has free resources, then we can provide smoother, finer scenes, personas, by increasing the size of batch. If the model and the scenario are enough, then we can use the free GPU resources to promote shader, and we can perform more granular shader calculations to get a more realistic, more compelling scenario.

Figure 24
Based on the previous calculation formula, we can conclude that in this experimental environment, each frame can be submitted in the number of batch at about 300. In general, a batch is an object in the game, then each frame can only spend 300 objects, or without considering the complexity of the situation, it is difficult for us to make fun of the game content out. So, if you want to spend a lot of things, we need to use the GPU to wrap different objects ' batch into one batch, and then submit them to reduce the batch count.

Figure 25
So what's stopping us from packing a multiple physics into one batch: textures. From the front we know that a batch, its rendering state is consistent, if different physics have different textures, then the usual means is not to put them in a batch to draw. To do this, we can come up with other tricks to pass on more textures, while allowing different physical triangles to recognize their textures. For example, for vertex data, where the usual position is XYZW, and W is always 1, then we can save a texture index for that vertex in W, and then pass all the textures to the texture unit (note: This is just a scenario I guess, which may not actually be successful, Because the vertex data is interpolated, the W can also be changed in pixel shader, which is simply a description that can be achieved through some techniques. In addition, we are able to put the textures of different objects in a map, and then simply assign the correct texture coordinates to the vertex data to achieve a batch commit, and depict all objects. This method is effective and often used. For example, the map of the character model, usually placed in a texture, and then the different parts of the characters corresponding to different texture coordinates, you can draw call through a complete model.

Figure 26
The other thing that destroys batch cannot merge is the transform matrix. Similarly, we are able to save it by technique and then draw n objects with a draw call.

Figure 27
Materials can also cause batch to not be merged. For more optimization on this, search by yourself, or find the answer you want in the graphics Pipeline performace chapter in GPU gems. Is it possible for someone to think that these optimizations do not cause the GPU to become very slow? It is true that these operations occupy the resources of the GPU, but as we know earlier, the GPU is much faster than the CPU and most of the time it is not loaded, our task is to get the GPU to run at full capacity for faster drawing.


Batch refers to a draw call. The GPU is much faster than the CPU, drawing the bottleneck is often the CPU does not submit large enough batch. The CPU spends a lot of time on the drive and 3D runtime, and these are not directly related to the size of batch, in order to draw more triangles, all the ways to integrate batch, draw more objects.
Hope that through this article, understand the basic knowledge of 3D game programming optimization, but also want to let more people like the game development, can make more excellent works!

"Batch,batch,batch": what does it really mean?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.