Deep understanding of DirectX d3d9

Source: Internet
Author: User

In-depth understanding of d3d9 is of great significance to graphic programmers. I have summarized some of my previous study notes and hope to help my friends. This is because of the scattered notes and complicated ideas. Please forgive me.

In fact, as long as you can fully understand d3dlock, d3dusage, d3dpool, lost device, query, present (), beginscene (), endscene () and other concepts, you can even understand d3d9, I don't know if you feel the same way. If you can answer the following questions, you can pass the examination :).
1. What are the essential differences between d3dpool_default, d3dpool_managed, d3dpool_systemmem, and d3dpool_scratch?
2. How to Use d3dusage?
3. What is an adapter? What is d3d device? What is the difference between Hal device and ref device? What is the relationship between the device type and the vertex processing type?
4. How does app (CPU), runtime, driver, and GPU work together? Is d3d api a synchronous function or an asynchronous function?
5. What happened to the lost device? Why do d3dpool_default resources need to be created again after the device is lost?

D3d has three objects: d3d object, d3d adapter, and d3d device. D3d object is a COM object that uses the d3d function. It provides the function of creating device and enumerating adapter. Adapter is an abstraction of computer graphics hardware and software performance. It contains device. Device is the core of d3d. It encapsulates the entire graphic pipeline, including transformation, illumination, and raster (coloring). The pipeline varies according to the d3d version, for example, the latest d3d10 contains the new GS geometric processing. All functions of the graphic pipeline are provided by the driver, while the dirver is divided into two types: GPU hardware driver and software driver, this is why there are two types of device, ref and Hal in d3d. When using ref device, The raster function of the graphic pipeline is simulated by the software driver on the CPU, from the name, the ref device can be used as a reference for hardware manufacturers. Therefore, it should be implemented in full software and have all DX standard functions. When Hal device is used, runtime uses the HAL Hardware layer to control the GPU for transformation, illumination, and Rasterization, in addition, only the Hal device implements both hardware vertex processing and software vertex processing (ref device generally does not use hardware vertex processing unless it is on the driver, such as perfhud ). There is also an uncommon software device. You can use DDI to write your own software graphics driver, and then register it into the system for use in the program.

Check System Software and Hardware performance.
At the beginning of the program, we need to judge the performance of the target machine. The main process is:
Determine the buffer format to use
Getadaptercount ()

Getadapteridentifier // get the adapter description
Checkdevicetype // determine whether the device on the specified adapter supports hardware acceleration.
Getdevicecaps // specify the performance of the device and determine whether hardware vertex processing (T & L) is supported)
Getadaptermodecount // get all available display modes of the specified buffer format on the adapter
Enumadaptermodes // enumerate all display modes
For more information, see the DX documentation.

Windows graphics systems are mainly divided into four layers: Graphics applications, d3d runtime, software driver, and GPU. This layer is divided by function. In fact, the boundaries between them are not so clear. For example, runtime actually contains the software driver with the user mode. The detailed structure is not described here. In runtime, there is a very important structure called command buffer. When an application calls a d3d API, runtime converts the call to a device-independent command, then, buffer the command into the command buffer. The buffer size dynamically changes according to the task load. When the buffer is full, runtime will flush all commands to the driver in the kernel mode, and the driver also has a buffer to store the hardware-related commands that have been converted, d3d generally only allows it to buffer up to three frames of graphic commands, and the runtime and driver will be properly optimized by the commands in the buffer. For example, we can set the same render state in a program consecutively, we will see the following message in the debugging information: "ignoring redundant setrenderstate-X", which is the useless State setting command automatically discarded by runtime. In d3d9, the query mechanism can be used to asynchronously work with the GPU. query is a query command used to query the runtime, driver, or GPU status. The query object in d3d9 has three states, signaled, building, and issued: when they are idle, they place the query status in signaled state. The query points start and end. The query start indicates that the object starts to record the data required by the application, after the query is completed by the application, if the queried object is in the idle state, the queried object is in the signaled state. Getdata is used to obtain the query result. If d3d_ OK is returned, the result is available. If the d3dgetdata_flush flag is used, all commands in the command buffer are sent to the driver. Now we know that most of the d3d APIs are synchronous functions. After the application calls them, the runtime simply adds them to the Command Buffer. Some may wonder how to determine the frame rate? How can we analyze GPU time? For the first question, we need to check whether the call to the present () function is blocked when a frame is completed. The answer is that the call may be blocked or not blocked, it depends on the number of commands in the buffer allowed by the runtime. If the buffer limit is exceeded, the present function will be blocked. How can the present not be blocked at all? When the GPU executes a heavy drawing task, the CPU progress will greatly exceed the GPU, leading to faster game logic than graphical display, which is obviously not acceptable. It is very troublesome to determine the GPU working time. First, we need to solve the synchronization problem and measure the GPU time. First, we must make the CPU and GPU work asynchronously, you can use the query mechanism in d3d9 to do this. Let's take a look at the example in accurately profiling driect3d api cals:

Idirect3dquery9 * pqueryevent;

// 1. Create an event-type query event
M_pd3ddevice-> createquery (d3dquerytype_event, & pqueryevent );
// 2. Add a query end tag to the Command Buffer. This query starts with createdevice by default.
Pqueryevent-> Issue (d3dissue_end );
// 3. clear all commands in command buffer to the driver, and cyclically query the event object to the signaled state. After the GPU completes all the commands in CB, the query event status is converted.
While (s_false = pqueryevent-> getdata (null, 0, d3dgetdata_flush ))
Large_integer start, stop;
Queryperformancecounter (& START );
Settexture ();
Drawprimitive ();
Pqueryevent-> Issue (d3dissue_end );
While (s_false = pqueryevent-> getdata (null, 0, d3dgetdata_flush ))
Queryperformancecounter (& stop );

1. the d3dgetdata_flush flag is used for the first getdata call, indicating that the drawing commands in the command buffer should be cleared to the driver. After the GPU completes processing all the commands, the query object status will be set to signaled.
2. Add the device-independent settexture command to the runtime command buffer.
3. Add the device-independent drawprimitive command to the runtime command buffer.
4. Add the device-independent issue command to the runtime command buffer.
5. getdata clears all the commands in the buffer to the driver. Note that getdata will not be returned until the GPU completes the execution of all commands. There will be a switch from user mode to core mode.
6. Wait until the driver converts all commands to hardware-related commands and fills them in the Driver Buffer. Then, the call is returned from the core mode to the user mode.
7. getdata cyclically queries the object status. After the GPU completes all the commands in the driver buffer, the status of the query object is changed.

In the following cases, the runtime command buffer may be cleared and a mode switch may occur:
1. Lock method (some conditions and some lock flag)

2. Create a device, vertex buffer, index buffer, and texture
3. Completely release devices, vertex buffering, index buffering, and texture Resources
4. Call validatedevice
5. Call present
6. The command buffer is full.
7. Call the getdata function with d3dgetdata_flush

I cannot fully understand the explanation of d3dquerytype_event (query for any and all asynchronous events that have been issued from API CALS, only after the GPU completes d3dquerytype_event type query and adds the d3dissue_end tag to CB, the query object status is set to signaled, so the CPU waits for the query to be asynchronous. For efficiency, we should try to use beginscene endscene before present. Why does it affect efficiency? The reason can only be guessed. The endscene may cause command buffer flush, which may lead to an execution mode switch, or some operations of d3d Runtime on managed resources. In addition, endscene is not a synchronization method and will not be returned until the driver finishes executing all the commands.

D3d rutime memory is divided into three types: video memory (VM), AGP memory (AM), and system memory (SM). All d3d resources are created in these three types of memory, when creating resources, we can specify the following storage flag: d3dpool_default, d3dpool_managed, d3dpool_systemmem, and d3dpool_scratch. VM is located on the graphics card memory, the CPU can only access through the AGP or PCI-E bus, read and write speed is very slow, the CPU continuous write VM slightly faster than read, when the CPU writes a VM, it will allocate 32 or 64 bytes of write buffer (depending on the cache line length) in the cache. When the buffer is full, it will be written to the VM at one time; SM is the system memory, CPU reads and writes are very fast, Because SM is cached to Level 2 buffer, but the GPU cannot directly access the system buffer, so the resources created in SM are, GPUs cannot be used directly. am is the most troublesome type. am actually exists in the system memory, but this part of mem is not cached by the CPU, this means that CPU read/write am will write a cache missing and then access am through the memory bus. Therefore, CPU read/write am is slower than SM, but continuous write is slightly faster than read, the reason is that the CPU write am uses write combining, And the GPU can access am directly through the AGP or PCI-E bus.

If we use d3dpool_default to create resources, it means that d3d runtime will automatically use the storage type based on the resource usage method we specify, generally Vm or am, the system will not perform additional backup elsewhere. when the device is lost, the resources will also be lost. However, the system does not use d3dpool_systemmem or d3dpool_managed to replace it during creation. Note that they are completely different pool types and the textures created to d3dpool_default cannot be locked by the CPU, unless it is a dynamic texture. However, the vb ib rendertarget back buffers created in d3dpool_default can be locked. When you use d3dpool_default to create a resource, if the memory is used up, the managed resource will be swapped out to release enough space. D3dpool_systemmem and d3dpool_scratch are both in SM. The difference is that when d3dpool_systemmem is used, the resource format is limited by device performance, because the resources may be updated to AM or VM for use by the graphics system, but scratch is only limited by runtime, so such resources cannot be used by graphics systems. D3druntime optimizes d3dusage_dynamic resources. Generally, it is placed in AM, but it cannot be completely guaranteed. In addition, why static textures cannot be locked or dynamic textures can be used is related to the d3d runtime design, which will be described later in the d3dlock description.

D3dpool_managed indicates that d3d Runtime is used to manage resources. The created resources are copied in two copies, one in SM and the other in VM/AM, the created resource is stored in SM. When the GPU needs to use resources, d3d runtime automatically copies the data to the VM. After the resource is modified by the GPU, runtime is automatically updated to SM when necessary, and updated to VM after being modified in SM. Therefore, do not use hosted data that is frequently modified by the CPU or GPU. This will produce a very expensive synchronization burden. When the lost device occurs, runtime will automatically use copy in SM to restore data in the VM during reset, because not all data backed up in SM will be submitted to the VM, therefore, the actual backup data can be much larger than the VM capacity. As the number of resources increases, the backup data is likely to be exchanged to the hard disk. This is a very slow reset process, runtime reserves a timestamp for each managed resource. When runtime needs to copy the backup data to the VM, runtime allocates the explicit storage space in the VM. If the allocation fails, indicates that the VM has no available space. In this way, the runtime uses the LRU algorithm to release related resources based on the timestamp. setpriority uses the timestamp to set the resource priority. Recently, common resources will have a higher priority, in this way, the runtime can reasonably release resources through priority, and the probability of using this situation immediately after release is relatively small. The application can also call evictmanagedresources to forcibly clear all the managed resources in the VM, such If the next frame is useful to managed resources, the runtime needs to be reloaded, which has a great impact on performance. Generally, it is not used, but during level conversion, this function is very useful and can eliminate memory fragments in the VM. The LRU algorithm has performance defects in some situations. For example, when the resources required to draw a frame cannot be installed by the VM (managed), the LRU algorithm may cause serious performance fluctuations. For example:

Beginscene ();

Draw (box0 );
Draw (box1 );
Draw (box2 );
Draw (box3 );
Draw (circle0 );
Draw (circle1 );
Endscene ();
Present ();

Assume that the VM can only hold the data of five of the five ry, then according to the LRU algorithm, part of the data must be cleared before box3 is drawn. Therefore, circle0 ......, Obviously, clearing box2 is the most reasonable, so runtime uses the MRU algorithm to process the subsequent draw call, which can effectively solve the performance fluctuation problem, however, whether the resource is used is checked by frame. Not every draw call is recorded. The symbol of each frame is the beginscene/endscene pair, therefore, the proper use of beginscene/endscene in this case can greatly improve the performance when the VM is insufficient. According to the prompts In the DX document, we can also use the query mechanism to obtain more information about the runtime managed resource, but it seems to be useful only in the runtime debug mode. It is very important to understand how to manage the resource in runtime, but do not expose these details when writing programs, because these things often change. The last note is that not only does runteime manage resource, but driver may also implement these functions. We can use the d3dcaps2_canmanageresource flag to obtain information about whether the driver implements the resource management function, you can also specify d3dcreate_disable_driver_management when creating the device to disable the driver resource management function.

D3dlock exploring d3d runtime work

What happens if you lock the default resource? The default resource may be in the VM or am. If it is in the VM, a temporary buffer must be opened in the System Content and returned to the data. After the application fills the data in the temporary buffer, during unlock, the runtime will transmit the temporarily buffered data back to the VM. If the resource d3dusage attribute is not writeonly, the system also needs to copy a copy of the original data from the VM to the temporary buffer, which is why the program performance is reduced if writeonly is not specified. The CPU write am also needs to be noted, because the CPU write am is generally write combining, that is, the write is buffered to a cache line, when the cache line is full, it is flushed into am. The first thing to note is that the write data must be weak order (graphic data generally meets this requirement ), it is said that d3druntime and NV dirver have a small bug, that is, when the CPU is not flushed to AM, the GPU starts to draw related resource errors. In this case, use commands such as sfence to flush cache line. Second, please try to fill a cache line at a time, otherwise there will be extra latency, because the CPU must flush the entire cache line to the target each time, but if we only write part of the line, the CPU must first read the entire line long data combine from AM and then flush again. Third, write in sequence as much as possible. Random write will make write combining cumbersome. If it is a random write resource, do not use d3dusage_dynamic to create it. Use d3dpool_managed to complete the write in SM.

Common textures (d3dpool_default) cannot be locked because they are in the VM and can only be accessed through updatesurface and updatetexture. Why does d3d prevent us from locking static Textures, but let's lock static vb ib? I guess there may be two reasons: first, the texture matrix is generally very large, and the texture is stored in the GPU in two-dimensional mode; the second is that the texture is stored in native format inside the GPU, not in the clear-text rgba format. Because the dynamic texture needs to be modified frequently, d3d will be specially stored. Dynamic textures with high-frequency modifications are not suitable for dynamic attribute creation. There are two scenarios to describe dynamic textures, one is the rendertarget written by the GPU, and the other is the texture video written by the CPU. We know that the dynamic resources are generally placed in the AM, and the GPU access to the am needs to go through the AGP/PCI-E bus, the speed is much slower than the VM, while the CPU access am is much slower than the SM. If the resource is dynamic, it means that the GPU and CPU access resources will be continuously delayed, therefore, it is best to create a copy of this type of resource with d3dpool_default and d3dpool_systemmem respectively. It is better to manually perform bidirectional updates on your own. Do not create rendertarget with the d3dpool_managed attribute. This is extremely inefficient and the cause is analyzed by yourself. For resources with less frequent changes, we recommend that you use default to create and manually update the resource, because the efficiency loss of one update is far less than the loss of GPU's persistent access to AM.

Unreasonable locks will seriously affect program performance, because generally the lock needs to wait until all the draw commands before the command buffer are executed to return, otherwise it is likely to modify the resources in use, from lock return to unlock after modification, the GPU is all idle and there is no reasonable use of the GPU and CPU concurrency. dx8.0 introduces a new lock mark d3dlock_discard, indicating that the resource will not be read, the system only writes all the resources. In this way, the driver and runtime come together and return another VM address pointer to the application immediately. The original pointer is discarded after this unlock and is no longer used, in this way, the CPU lock does not need to wait for the GPU to use resources and can continue to operate on graphics resources (vertex buffering and index buffering). This technology is called vb ib renaming ).

A lot of confusion comes from the lack of underlying data. I believe that if Ms opens the d3d source code and the open driver interface specification, NV/ATI will display the open driver and hardware architecture information, which can be easily understood.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.