"Reprint" in-depth understanding Direct3D9

Source: Internet
Author: User

Original:effulgent "In-depth understanding Direct3D9" Finishing Edition (Turn)Deep understanding of Direct3D9

In-depth understanding of d3d9 is of great significance to graphic programmers, I put together some of the previous study notes, hoping to some of my friends help, because it is scattered notes, ideas are miscellaneous, please forgive me.

In fact, as long as you can perfectly understand D3dlock, D3dusage, D3dpool, LOST DEVICE, QUERY, Present (), BeginScene (), EndScene () and other concepts, even if it is understood D3d9, I don't know if we have any sympathy. There are a few questions, if you can successfully answer, even if the pass:).

What are the essential differences between question 1, D3dpool_default, d3dpool_managed, D3dpool_systemmem and D3dpool_scratch?
2, d3dusage How to use the specific?
3. What is adapter? What is D3D Device? What is the difference between HAL device and ref device? What does the type of device have to do with the vertex processing type?
4. How does the APP (CPU), RUNTIME, DRIVER, GPU work together? Is the D3D API a synchronous function or an asynchronous function?
5. What happened to Lost device? Why does the D3dpool_default type resource need to be recreated after the device is lost?

There are three main objects in D3D, they are D3D object, D3D ADAPTER , and D3D DEVICE.

D3D object is simply a COM object that uses the D3D feature, which provides the ability to create device and enumerate adapter.

Adapter is an abstraction of computer graphics hardware and software performance, which contains device.

Device is the core of D3D, which wraps the entire graph pipelining, including transformation, illumination and rasterization (coloring);

DEVICE

According to the D3D version, the pipeline also has a difference, such as the latest D3D10 contains the new GS geometry processing.

All features of the graphics pipeline are provided by driver.

And Dirver is divided into two categories, one is GPU hardware driver, the other is software driver.

This is why there are two main types of device, ref and HAL in D3D.

When using ref device, the rasterization function of the graphics pipeline is simulated by the software driver on the CPU, and REF device can be seen from the name of the hardware manufacturer for functional reference, so it should be common sense that it should be full software implementation, with all DX standard functions.

When using the HAL device, runtime uses the HAL hardware layer to control the GPU for transformation, illumination, and rasterization, and only the hardware vertex processing and software vertex processing are implemented in the HAL device (the hardware vertex processing is not typically used by REF device). Unless you tamper with the driver, such as Perfhud).

There is also an infrequently used software DEVICE that allows users to write their own software graphics drivers using DDI, then register them in the system and then use them in the program.

Direct3D System Integration

Check system software hardware performance.
At the beginning of the program we will determine the performance of the target machine, the main process is:

    1. Determine the buffer format to use
    2. Getadaptercount ()
    3. Getadapterdisplaymode
    4. GetAdapterIdentifier//Get Adapter description
    5. CheckDeviceType//Determine if the device on the specified adapter supports hardware acceleration
    6. GetDeviceCaps//Specifies the performance of the device, primarily determining whether hardware vertex processing is supported (T&L)
    7. GetAdapterModeCount//Get all available display modes in the specified buffer format on the adapter
    8. EnumAdapterModes//Enumerate all display modes
    9. CheckDeviceFormat
    10. CheckDeviceMultiSampleType

Please refer to the DX documentation for more detailed use.

The Windows graphics system is divided into four main layers:

    • Graphics applications
    • D3D RUNTIME
    • Software DRIVER
    • GPU

This four-layer is divided by function, in fact, the boundaries between them is not so clear, such as runtime in fact also contains the user mode software DRIVER, detailed structure here no more say.

And in runtime there is a very important structure called command buffer, when the application calls a D3D API, runtime will invoke the conversion to device-independent command, and then buffer the command into this command buffer, The size of this buffer is dynamically changed according to the task load, and when this buffer is full, runtime will flush all commands into the drive in kernel mode, and there is a buffer in the drive to store the hardware-related commands that have been converted.

D3D generally only allows it to buffer up to 3 frames of graphics instructions, and runtime and driver will be properly optimized in buffer, for example, we set the same render state in the program, we will see the following information in the debug information "ignoring Redundant Setrenderstate-x ", this is the runtime automatically discards the useless state Settings command.

In D3d9, you can use the query mechanism to work asynchronously with the GPU.

The so-called query command, used to query the state of runtime, driver or GPU, the query object in the D3D9 has three states

    • Signaled
    • BUILDING
    • ISSUED

When they are idle, the query status is placed in the signaled state, the query begins and ends, and the query begins to indicate that the object starts recording the data required by the application.

When the application specifies that the queried object is idle, the queried object places the query object in the signaled state when the query is finished.

The GetData is used to obtain the results of the query, and if the return is D3D_OK, the result is available, and if the D3dgetdata_flush flag is used, all commands in command buffer are sent to driver.

Now that we know that the D3D API is mostly a synchronous function, the runtime simply adds it to command BUFFER after the application is called.

    • Some people may wonder how we measure frame rate?
    • And how do you analyze GPU time?

For the first question we want to see when a frame is finished, that is, the present () function call is blocked, the answer is likely to be blocked or not blocked, depends on the runtime allows the buffer to exist in the number of instructions, if the amount is exceeded, then the present function will be blocked, How present is completely non-blocking, when the GPU performs heavy drawing tasks, the CPU works much faster than the GPU, causing the game logic to be quicker than the graphics display, which is obviously not possible.

Measuring GPU working time is a very troublesome thing, first we have to solve the synchronization problem, to measure the GPU time, first we have to let the CPU and GPU work asynchronously, in D3d9 can use the query mechanism to do this, let us look at accurately Profiling Examples in Driect3d API calls:

idirect3dquery9* pqueryevent;

1. Create a query event of the event type
M_pd3ddevice->createquery (D3dquerytype_event, &pqueryevent);
2. In command buffer, add a tag for the end of the query, which is started by default in CreateDevice
Pqueryevent->issue (D3dissue_end);
3. Empty all commands in command buffer into driver and loop through the event object to the signaled state, and the query event state will be converted when the GPU finishes all the commands in the CB.
while (S_FALSE = = Pqueryevent->getdata (NULL, 0, D3dgetdata_flush));
Large_integer start, stop;
QueryPerformanceCounter (&start);
SetTexture ();
DrawPrimitive ();
Pqueryevent->issue (D3dissue_end);
while (S_FALSE = = Pqueryevent->getdata (NULL, 0, D3dgetdata_flush));
QueryPerformanceCounter (&stop);

1. The first GetData call uses the D3DGETDATA_FLUSH flag, which means that the draw command in command buffer is emptied to driver, and when the GPU finishes processing all the commands, the Query object status is signaled.
2. Add the device-independent settexture command into the runtime's command buffer.
3. Add the device-independent drawprimitive command into the runtime's command buffer.
4. Add the device-independent issue command into the runtime's command buffer.
5.GetData empties all the commands in the buffer into driver, noting that GetData does not wait for the GPU to finish executing all commands before returning. There will be a switch from user mode to core mode.
6. Wait for driver to convert all commands to hardware-related instructions, and after populating into driver buffer, call return from core mode to user mode.
7.GetData Circular Query Query object state. When the GPU finishes all the instructions in driver buffer, it changes the state of the query object.

The runtime COMMAND BUFFER may be emptied as follows and cause a mode switch:

1.Lock method (under certain conditions and some Lock flags)
2. Creating devices, vertex buffers, index buffers, and textures
3. Fully release device, vertex buffer, index buffer, and texture resources
4. Call Validatedevice
5. Call present
6.COMMAND Buffer is full
7. Call the GetData function with D3dgetdata_flush

I do not fully understand the d3dquerytype_event explanation (Query for any and all asynchronous events which has been issued from API calls) understand that friends must tell me that only Knowing that when the GPU finishes processing the d3dissue_end tag that the D3dquerytype_event type query adds to the CB, the query object state is signaled state, so the CPU waits for the query to be asynchronous.

For efficiency, try to minimize the use of BeginScene endscene before present, why does it affect efficiency?

Reason can only guess,

    • It is possible that EndScene will trigger a command buffer flush so there will be a mode switch to execute,
    • It is also possible to trigger some operations on the managed resource by the D3D runtime.
    • And EndScene is not a synchronous method, it does not wait for driver to return after all the commands have been executed.

D3D Rutime's memory type, divided into 3 kinds,

    • VIDEO MEMORY(VM)
    • AGP MEMORY(AM)
    • SYSTEM MEMORY(SM)

Memory

  • The VM is the video memory on the graphics card, the CPU can only be accessed through the AGP or PCI-E bus, the read and write speed is very slow, the CPU writes the VM a little faster than read, because the CPU writes the VM will allocate 32 or 64 bytes in the cache (depending on the cache line length) write buffer, When the buffer is full, the VM is written once;
  • SM is the system memory, CPU reads and writes are very fast, because SM is cached to level 2 buffer, but the GPU is not directly accessible to the system buffer, so the creation of resources in SM, the GPU is not directly used;
  • Am is the most troublesome type, am is actually in the system memory, but this part of MEM will not be CPU cache, means that CPU read/write am will write a cache missing and then through the memory bus to access am, so CPU read and write AM compared to SM will be slower, But the sequential write will be slightly faster than read, because the CPU writes AM using "write combining", and the GPU can access am directly through the AGP or PCI-E bus.

All D3D resources are created in these 3 types of memory, and when you create a resource, we can specify the following storage flags

    • D3dpool_default
    • D3dpool_managed
    • D3dpool_systemmem
    • D3dpool_scratch.

If we use D3dpool_default to create a resource, it means that D3D runtime automatically uses the storage type according to the resource usage we specify, typically VM or am, and the system does not make additional backups elsewhere, and when the device is lost, the contents of these resources are lost. But the system does not replace it with D3DPOOL_SYSTEMMEM or d3dpool_managed when it is created, notice that they are completely different pool types, and that textures created into d3dpool_default cannot be locked by the CPU. Unless it is a dynamic texture. But the VB IB rendertarget back buffers created in D3dpool_default can be lock. When you create a resource with D3dpool_default, if the video memory is already used, the managed resource is swapped out of memory to free up enough space.

D3dpool_systemmem and D3dpool_scratch are both located in SM, and the difference is that when using D3DPOOL_SYSTEMMEM, the resource format is limited by device performance because resources are likely to be updated to AM or VM for use by the graphics system. However, scratch is only restricted by runtime, so this resource cannot be used by the graphics system.

D3druntime will optimize the d3dusage_dynamic resources, usually placed in am, but not fully guaranteed. In addition, the static texture can not be lock, dynamic texture, it is related to the design of D3D runtime, in the later D3dlock description will be described.

D3dpool_managed means to have the D3D runtime to manage the resources, the created resources will have 2 copies, one in SM, one in Vm/am, created when L is in SM, when the GPU needs to use resources D3D Runtime automatically copies the data to the VM, and when the resource is modified by the GPU, the runtime automatically updates it to SM when necessary, and is updated to the VM when modified in SM. Therefore, the CPU or GPU frequency of the modified data, must not use the managed type, which will create a very expensive synchronization burden. When the lost device occurs, the runtime automatically uses copy in the SM to recover the data in the VM, because the data that is backed up in SM is not all committed to the VM, so the actual backup data can be far more than the VM capacity, and as resources grow, Backup data is likely to be swapped to the hard disk, this is the process of reset may become unusually slow, runtime to each managed resource has a time stamp, when the runtime needs to copy the backup data to the VM, runtime will allocate memory space in the VM, If the allocation fails to indicate that the VM has no free space, the runtime uses the LRU algorithm to release the related resources based on the timestamp, setpriority the priority of the resource by the timestamp, and the most recently used resources will have a high priority. In this way, the runtime can reasonably release resources by priority, and the chances of using the condition immediately after release are small, and the application can call evictmanagedresources to force the emptying of all managed resources in the VM. So if the next frame is useful to managed resources, runtime needs to reload, this has a great impact on performance, usually not used, but at the time of the level conversion, this function is very useful to eliminate the memory fragmentation in the VM. In some cases, the LRU algorithm has a performance flaw, such as when the amount of resources required to draw a frame cannot be loaded by the VM (MANAGED), the use of the LRU algorithm can lead to severe performance fluctuations, such as the following example:

BeginScene ();
Draw (Box0);
Draw (Box1);
Draw (BOX2);
Draw (BOX3);
Draw (CIRCLE0);
Draw (CIRCLE1);
EndScene ();
Present ();

Assuming that the VM can only load the data of 5 of them, then according to the LRU algorithm, you must clear some of the data before you draw the Box3, which is bound to be Circle0 ..., obviously emptying Box2 is the most reasonable, so this is runtime using the MRU algorithm to process the subsequent draw Call can be a good solution to the performance fluctuations, but whether the resource is being used is measured in frame, not every draw call is recorded, and each frame's logo is Beginscene/endscene, Therefore, the reasonable use of beginscene/endscene in this case can improve the performance of the VM in case of good enough. According to the DX documentation we can also use the query mechanism to get more information about runtime MANAGED resource, but it seems to be useful only in runtime debug mode, understanding how runtime Manage resource is important, But do not expose these details when writing programs, because these things often change. Finally, it is also to be reminded that not only Runteime will manage Resource,driver is also likely to achieve these functions, we can get driver through the D3DCAPS2_CANMANAGERESOURCE flag whether the implementation of resource management functions of information, You can also specify d3dcreate_disable_driver_management to turn off the DRIVER resource management feature when CreateDevice.

D3dlock Explore D3D Runtime work

What happens if the lock default resource occurs? The default resource may be in the VM or am, and if in the VM, a temporary buffer must be created in the system content to return to the data, and when the application populates the data to a temporary buffer, the runtime will unlock the temporary buffered data back to the VM. If the resource D3dusage attribute is not writeonly, then the system also needs to copy a copy of the original data from the VM to the staging buffer, which is why the WriteOnly is not specified to degrade the performance of the program. CPU write am also have to pay attention to the place, because the CPU writes AM is generally write combining, that is, write buffer to a cache line, when the cache line is full after the flush to AM, the first thing to note is that the writing data must be weak Order (graphic data generally meet this requirement), it is said that D3druntime and NV dirver a little bug, that is, when the CPU is not flush to AM, the GPU began to draw related resources generated errors, then use instructions such as sfence flush CACHE Line. Second, try to write one cache line at a time, otherwise there will be additional delay because the CPU must flush the entire cache line to the target each time, but if we write only some bytes in line, The CPU must first read the entire line length data combine from AM and then flush again. Third, write as much as possible, random write will make the write combining instead of cumbersome, if it is a random write resources, do not use d3dusage_dynamic to create, use d3dpool_managed, so write will be complete in SM.

Normal texture (D3dpool_default) is not locked, because it is located in the VM, only through Updatesurface and updatetexture to access, why D3D not let us lock static texture, but let us lock static VB IB it? I guess there are 2 reasons, the first is that the texture matrix is generally very large, and the texture within the GPU has two-dimensional storage, and the second is that the texture inside the GPU is stored in native format, not the plaintext Rgba format. Dynamic textures because it indicates that the texture needs to be modified frequently, so D3D will be specially stored, high frequency modified dynamic texture is not suitable for dynamic property creation, in two cases, one is the GPU write RenderTarget, one is the CPU writes texture VIDEO, We know that dynamic resources are generally placed in am, the GPU access am needs to go through the AGP/PCI-E bus, the speed is much slower than the VM, and CPU access AM is much slower than SM, if the resource is a dynamic property, which means that the GPU and CPU access resources will continue to delay, So this kind of resources is best to d3dpool_default and D3dpool_systemmem each to create a copy, their own manual two-way update better. Do not rendertarget with the d3dpool_managed attribute, so that the efficiency is very low, the reasons for their own analysis. For resources with less frequent changes, it is recommended to use default creation, which is updated manually, because the efficiency loss of one update is much smaller than the loss of the GPU's continued access to AM.

Unreasonable lock can seriously affect the performance of the program, because the general lock needs to wait until the command buffer before the drawing instructions are all executed to return, otherwise it is likely to modify the resources in use, from lock back to the completion of the modified unlock time the GPU is all idle, Without a reasonable use of the GPU and CPU parallelism, DX8.0 introduced a new lock flag d3dlock_discard, which means that the resource will not be read, only the full write resource, so that the driver and runtime mates come with a cheat, immediately return to the application another block VM address pointer, The original pointer is discarded after this unlock, so that the CPU lock does not have to wait for the GPU to use resources, can continue to manipulate the graphics resources (vertex buffer and index buffer), this technology is called VB IB Exchange name (renaming).

A lot of confusion stems from the lack of the underlying data, I believe that if Ms Open D3D source code, open Drive Interface specification, Nv/ati display open drive and hardware architecture information, these things are easy to understand.

"Reprint" in-depth understanding Direct3D9

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.