-DirectX Performance Optimization

Source: Internet
Author: User

1. Only Clear when required.
IDirect3DDevice9: the Clear function usually takes a lot of time, so you need to call as few as possible, and only Clear the cache that needs to be cleared.
2
Minimize status switching. And combine the status switches that need to be performed.

The following statuses are available: RenderState, SamplerState, and TextureStageState.
3. The texture size should be as small as possible
4
Objects in previous to later rendering scenarios
Rendering from the beginning to the end allows you to select unnecessary objects and pixels as early as possible.
5
Use a triangle strip to replace the Triangle List and the triangle fan. In order to make better use of the vertex cache, we should consider reusing the vertex as soon as possible when arranging the strip.
6
The system resources consumed by the root node are gradually reduced.
7. regularly checks program performance.
This will make it easier to find the part that causes sudden changes in performance.
8
Minimize vertex cache Switching
9 use static vertex cache whenever possible
10
For static objects, a large static vertex cache is used for each FVF to store vertex data of multiple objects, rather than one vertex cache for each object.
The purpose is to reduce vertex cache switching.

11 if the program needs to randomly access the vertex cache in the AGP memory, the vertex format should be a multiple of 32 bytes. Otherwise, select the appropriate minimum format.
32 bytes
That is, 8 float data or 2 vector4.
12. Use vertex index rendering to make better use of vertex cache.
13
If the deep cache format contains a template cache, the two are always Clear together.
14. Merge the calculation result with the output shader command:
// Rather
Than doing a multiply and add, and then output the data
// Two
Instructions:
Mad r2, r1, v0, c0
Mov oD0, r2
// Combine both
In a single instruction, because this eliminates
// Additional
Register copy.
Mad oD0, r1, v0, c0

To create a scenario object database, we first use a model with the lowest accuracy and gradually use a model with higher accuracy while ensuring performance. Pay close attention to the total number of triangles for rendering.

Http://nvidia.e-works.net.cn/document/200910/artic le9305_1.htm

The elements that use the same rendering status and Paster are drawn together to minimize vertex cache and state switching. In addition, the status switching operation is centralized into a set of settings.

Minimize the number of light sources and use ambient light to increase the brightness. The direction light source is more efficient than the point light source and the spotlight because the direction of the light is fixed. Use the illumination range parameter to remove objects not affected by illumination. The mirrored highlight almost doubles the amount of illumination, so it is only used when needed,

Set D3DRS_SPECULARENABLE to FALSE, set the specular power of the material to 0, and set the specular color of the material to 0.

Reduce the texture size as much as possible to increase the possibility that the texture will be cached. Minimize texture switching and draw objects with the same texture in a centralized manner. Use square textures whenever possible. The fastest texture is 256x256. Four 128x128 textures are spliced to 256x256.

Connect World-View Matrix and set ViewMatrix as Identity to reduce Matrix multiplication.

Dynamic texture. First, check D3DCAPS2_DYNAMICTEXTURES to determine whether the hardware is supported.
Second, dynamic textures cannot be placed in MANAGED.
Pool. Dynamic textures can always be locked, even in D3DPOOL_DEFAULT. D3DLOCK_DISCARD is valid.

DrawProceduralTexture (pTex)
{
// PTex shocould not be very small
Because overhead
// Calling driver every D3DLOCK_DISCARD will not

// Justify the performance gain. Experimentation is encouraged.

PTex-> Lock (D3DLOCK_DISCARD );
  
PTex-> Unlock ();

PDev-> SetTexture ();
PDev-> DrawPrimitive ();
}

When you need to lock the vertex or index cache in each frame, you should use dynamic cache (D3DUSAGE_DYNAMIC ). Using D3DLOCK_DISCARD to lock dynamic cache can reduce latency. The D3DLOCK_NOOVERWRITE lock can be used to add new data to the idle cache without modification.

Data already written.

When Effect is used, the Rendering sequence should be arranged based on Effect and Technique, that is, objects with the same Effect and Technique should be drawn in a centralized manner. This reduces the overhead of status switching.

In general, the method to locate the rendering channel bottleneck is to change the workload of each step of the rendering channel. If the throughput also changes, that step is the bottleneck .. If a bottleneck is found, we need to find a solution to eliminate the bottleneck,
This can reduce the workload of this step and increase the workload of other steps.
Generally, the bottleneck before Rasterization is called "transform bound ",
The bottleneck after triangle setting is called "fill bound" to locate the bottleneck:
1. Change the color depth (16
32-bit). If the frame speed changes, the bottleneck should be on the padding rate of the RenderTarget.
2. Otherwise, try changing the texture size and texture filter settings,
If the frame speed changes, the bottleneck should be in the texture.
3. Otherwise, the resolution will be changed. If the frame speed changes, change the number of pixel shader commands. If the frame speed changes,
The bottleneck should be pixel shader. Otherwise, the bottleneck will be in the grating process.
4. Otherwise, change the vertex format. If the frame speed changes,
The bottleneck should be on the video card bandwidth.
5. If none of the above are met, the bottleneck lies on the CPU side.
36 optimization methods:

1. Try to reduce useless vertex data, such as texture coordinates. If an Object uses two sets of vertex data and one group is used, do not put them in a vertex buffer,
This reduces the amount of data transmitted.
2. Multiple streamsources, such as SkinMesh rendering,
You can put the vertex coordinates and normal data to be modified for each frame in a dynamic VB, and put other data (such as texture coordinates) that do not need to be modified into a static VB, this reduces the amount of data transmission.

3. Try to use 16-bit index buffering to avoid 32-bit indexing. On the one hand, bandwidth is wasted, and not all video cards support 32-bit index buffering.

4. You can use vertex shader to calculate static VB data. For example, the vertex of SkinMesh can be placed in vectex shader for calculation,
In this way, you can avoid transmitting data from each frame to the video memory from the AGP memory. In this way, you can also use static VB.
5. Do not use the Draw ** UP family functions to Draw polygon.

6. Plan the memory usage of the video card before designing the program to ensure that framebuffer, texture, and static VB can be placed in the local memory of the video card.

7. Try to make the vertex format a multiple of 32 bytes. You can consider using the compressed vertex format and then use vertex shader for solution. Or leave redundant parts,
Make the vertex size a multiple of 32 bytes.
8. The order of the vertex in the vertex buffer should be as close as possible to the draw order. Use strips instead of list.

9. If possible, use static vertex buffer instead of dynamic vertex buffer.

10. Dynamic VB uses the DISCARD parameter to lock updates, and uses NOOVERWRITE to add. Try not to use the lock call without parameters (0 ).

11. try to reduce the number of locks. Some things do not have to update VB for every frame. For example, it is enough to update VB 30 times per second for character animation.

12. If you need to draw too much vertex data, you can consider using the dashboard. However, the current graphics card's rendering capabilities are strong, so you need to weigh whether the dashboard can bring corresponding benefits,
If you use excessive reinforcement, you may transfer the bottleneck to the CPU.
13. Avoid excessive vertex computing, such as excessive light sources and over-complicated illumination computing (complex illumination model ),
The automatic texture generation will also increase the vertex calculation workload. If the texture coordinate transformation matrix is not a unit matrix, it will also increase the vertex calculation workload,
So if the texture transformation has ended,
Remember to set the texture transformation matrix as the unit matrix and adjust the texture coordinates.
14. Avoid excessive Vertex shader commands or excessive branches, and minimize vertex
Shader length and complexity. Use swizzling instead of mov whenever possible.
15. If pixel shader has a large scope,
It is complicated. You can try the full screen anti-sample. Maybe it's faster.
16. Try to draw in the order of front back.

17. Judging the Z value in the shader can avoid drawing invisible pixels, but nvidia recommends that simple shader do not do this. (Don't do this in
Simple shader ).
18. If possible, use vertex shader instead of pixel shader to change the calculation from pixel-by-pixel to vertex-by-vertex.

19. Minimize the texture size. A large texture may overload the texture cache, leading to reduced hit of the texture cache. A large texture may overload the Display memory,
At this time, the texture is obtained from the system memory.
20. As long as it is possible to use a 16-color texture, such as an environment texture or shadow map, they are a waste of 32-color textures.

21. Consider using DXT texture compression.
22. If possible, use simple texture filtering or mip map,
Unless necessary, do not use the three-line filter or the cross-object filter. light map and environment map are basically not required.

23. Dynamic is used only for textures that really need to be modified, and DISCRAD and WRITEONLY are used for lock.

24. Too many frame buffer reads and Writes can be considered to close translucent geometric objects such as the subsequent pass or particle system in some multi-pass rendering (if possible ).

25. If possible, use alpha test instead of alpha blending.
26. If you do not need stencel
Try to use a 16-bit Z buffer.
27. reduce the size of the RenderTarget texture, such as the shadow map environment texture.
It may be nice if you don't need such a big effect at all.
28. Try to clear stencel and Z buffer together. They are originally a buffer.

29. Minimize rendering status switching and draw as many polygon as possible at a time. (The maximum number of images is determined based on the performance of the video card, but generally there will be no more images.
Unless you do not need to change the texture or rendering status at all ).
30. Use shader instead of Fixed Pipeline whenever possible.

31. Use shader as much as possible to replace the Multipass rendering effect. 32. establish important resources first, such as Render target,
Shaders, textures, VB, IB, and so on. They are created into the system memory when the video memory is overloaded.
33. Do not call Resource Creation in the rendering loop.

34. Group by shader and texture before rendering. Group by shaders first and then by texture.
35. Color stenpencil Z
Buffer should be cleared in a Clear call whenever possible.
36. The size of a Vertex buffer is preferably 2-4 m.
Transfer
In-depth understanding of D3D9
Source: http://www.cnblogs.com/effulgent/archive/2009/02/1
0/1387438. html

In-depth understanding of D3D9 is of great significance to graphic programmers. I have summarized some of my previous study notes and hope to help my friends. This is because of the scattered notes and complicated ideas. Please forgive me.

In fact, as long as you can fully understand D3DLOCK, D3DUSAGE, D3DPOOL, and LOST
DEVICE, QUERY, Present (), BeginScene (), EndScene () and other concepts, even if you understand D3D9,
I don't know if you feel the same way. There are several problems,
If you can give a satisfactory answer, you can pass the examination :).
1,
What are the essential differences between D3DPOOL_DEFAULT, D3DPOOL_MANAGED, D3DPOOL_SYSTEMMEM, and D3DPOOL_SCRATCH?

2. How to Use D3DUSAGE?
3. What is an Adapter? What is D3D Device? HAL
What is the difference between Device and Ref Device? What is the relationship between the Device type and the Vertex Processing type?
4,
How does APP (CPU), RUNTIME, DRIVER, and GPU work together? Is D3D API a synchronous function or an asynchronous function?
5. Lost
What happened to Device? Why do D3DPOOL_DEFAULT resources need to be created again after the device is lost?
There are three objects in D3D: D3D
OBJECT, D3D ADAPTER, and D3D DEVICE. D3D
An OBJECT is a com object that uses the D3D function. It provides the function of creating a DEVICE and enumerating an ADAPTER. ADAPTER is for computer graphics hardware

And an abstraction of Software Performance, which includes DEVICE. DEVICE is the core of D3D. It encapsulates the entire graphic pipeline, including transformation, illumination, and raster (coloring). The pipeline varies according to the D3D version, for example, the latest D3D10 contains the new

GS geometric processing. All functions of the graphic pipeline are provided by the DRIVER, while the DIRVER is divided into two types: GPU hardware DRIVER and software DRIVER, this is why there are two types of devices in D3D,
REF and HAL. When ref device is used, the graphic Pipeline
The grating function of is simulated by the software DRIVER on the CPU.
DEVICE can see from the name that this provides functional reference for hardware manufacturers. Therefore, it should be implemented in full software and have all DX standard functions. When hal device is used, RUNTIME

Using the HAL Hardware layer to control the GPU for transformation, illumination, and raster, and only the hal device simultaneously implements hardware vertex processing and software vertex processing (REF
DEVICE generally cannot use hardware vertex processing, unless you do something on the driver, such as PERFHUD ).
There is also an uncommon SOFTWARE
DEVICE, you can use DDI to write your own software graphics driver, and then register it into the system for use in the program.
Check System Software and Hardware performance.

At the beginning of the program, we need to judge the performance of the target machine. The main process is:
Determine the buffer format to use
GetAdapterCount ()

GetAdapterDisplayMode
GetAdapterIdentifier // get the adapter description

CheckDeviceType // determine whether the device on the specified adapter supports hardware acceleration.
GetDeviceCaps
// Specify the performance of the device to determine whether hardware vertex processing (T & L) is supported)
GetAdapterModeCount
// Obtain all available display modes for the specified buffer format on the adapter
EnumAdapterModes // enumerate all display modes

CheckDeviceFormat
CheckDeviceMultiSampleType
For more information, see the DX documentation.

WINDOWS graphics systems are mainly divided into four layers: Graphics applications, D3D RUNTIME, and SOFTWARE
DRIVER and GPU. This layer is divided by function. In fact, the boundaries between them are not so clear. For example, RUNTIME actually contains the USER MODE
SOFTWARE
DRIVER. The detailed structure is not described here. In RUNTIME, there is a very important structure called command buffer. When an application calls a D3D
In API, RUNTIME converts the call to a device-independent command, and then caches the command
To this COMMAND
In the BUFFER, the BUFFER size is dynamically changed according to the task load. When the BUFFER is full, RUNTIME will FLUSH all commands to the driver in the KERNEL mode, the driver also has a BUFFER for storage.

For hardware-related commands that have been converted into, D3D generally only allows it to BUFFER up to three frames of graphic commands, and RUNTIME and DRIVER are properly optimized by the commands in the BUFFER, for example, we can set the same RENDER continuously in the program.
STATE, we will
The following message is displayed in the debugging information: "Ignoring redundant SetRenderState-
X ", which is the useless State setting command automatically discarded by RUNTIME. In D3D9, the QUERY mechanism can be used to asynchronously work with the GPU. The so-called QUERY is the QUERY

QUERY command, used to QUERY the RUNTIME, DRIVER, or GPU status. The QUERY object in D3D9 has three states: SIGNALED, BUILDING, and ISSUED. When they are idle, the QUERY State is placed in SIGNALED.
STATE, query minutes start and end,

The query starts to indicate that the object starts to record the data required by the application. After the query is completed by the application, if the queried object is idle, The queried object places the query object in the SIGNALED state. GetData is used to obtain the query results,

If D3D_ OK is returned, the result is available. If the D3DGETDATA_FLUSH flag is used, the COMMAND
All commands in the BUFFER are sent to the DRIVER. Now we know that most of the D3D APIs are synchronous functions. After the application calls them,

RUNTIME simply adds it to the COMMAND
BUFFER, some may wonder how we determine the frame rate? How can we analyze GPU time? For the first question, we need to check whether the call to the PRESENT () function is blocked when a frame is completed,

The answer is that it may be blocked or not blocked. It depends on the number of commands in the buffer allowed by the RUNTIME. If the value exceeds the limit, the PRESENT function will be blocked. How can the PRESENT not be blocked at all, when the GPU executes a heavy drawing task,

The CPU progress will greatly exceed the GPU, leading to faster game logic than graphical display, which is obviously not acceptable. It is very troublesome to determine the GPU working time. First, we need to solve the synchronization problem and measure the GPU time. First, we must make the CPU and GPU work asynchronously,

You can use the QUERY mechanism in D3D9 to do this. Let's take a look at the example in Accurately Profiling Driect3D api cals:

IDirect3DQuery9 * pQueryEvent;
// 1. Create an event-type query event

M_pD3DDevice-> CreateQuery (D3DQUERYTYPE_EVENT, & pQueryEvent );

// 2. Add a query end tag to the command buffer. This query starts with CreateDevice by default.

PQueryEvent-> Issue (D3DISSUE_END );
// 3. Set the COMMAND
All the commands in the BUFFER are cleared to the DRIVER, and the cyclic query event object is converted to the SIGNALED state. After the GPU completes all the commands in CB, the query event status is converted.

While (S_FALSE = pQueryEvent-> GetData (NULL, 0, D3DGETDATA_FLUSH ))

;
LARGE_INTEGER start, stop;

QueryPerformanceCounter (& start );
SetTexture ();

DrawPrimitive ();
PQueryEvent-> Issue (D3DISSUE_END );

While (S_FALSE = pQueryEvent-> GetData (NULL, 0, D3DGETDATA_FLUSH ))

;
QueryPerformanceCounter (& stop );

1. The D3DGETDATA_FLUSH flag is used for the first GetData call, indicating that the COMMAND
The draw commands in the BUFFER are cleared to the DRIVER. After the GPU processes all the commands, the query object status is set to SIGNALED.

2. Add the device-independent settexture command to the runtime command buffer.

3. Add the device-independent DrawPrimitive COMMAND to the runtime command buffer.

4. Add the device-independent issue command to the runtime command buffer.

5. GetData clears all the commands in the BUFFER to the DRIVER. Note that GETDATA will not be returned until the GPU completes the execution of all commands. There will be a switch from user mode to core mode.

6. Wait until the DRIVER converts all commands to hardware-related commands and fills them in the driver buffer. Then, the call is returned from the core mode to the user mode.

7. GetData cyclically queries the object status. After the GPU completes all the commands in the driver buffer, the status of the query object is changed.

In the following cases, the runtime command buffer may be cleared and a mode switch may occur:
1. Lock
Method (some conditions and some LOCK flag)
2. Create a device, vertex buffer, index buffer, and texture
3. Completely release devices, vertex buffering, index buffering, and texture Resources

4. Call ValidateDevice
5. Call Present
6. The command buffer is full.

7. Call the GetData function with D3DGETDATA_FLUSH
I cannot fully understand D3DQUERYTYPE_EVENT (Query
For any and all asynchronous events that have been issued from API
A clear friend must tell me that only when the GPU processing is complete

After the D3DISSUE_END tag is added to the CB for D3DQUERYTYPE_EVENT type query, the query object status is set to SIGNALED, so the CPU waits for the query to be asynchronous. For efficiency, use BEGINSCENE as few as possible before PRESENT.
ENDSCENE pair,
Why does it affect efficiency? The cause can only be guessed. The EndScene may cause Command buffer
In this case, flush has an execution mode switch, which may also lead to D3D RUNTIME operations on MANAGED resources. And ENDSCENE is not a synchronization method,

It will not be returned until the DRIVER finishes executing all the commands.
D3D rutime memory is divided into three types: video memory (VM) and AGP
MEMORY (AM) and system memory (SM), all D3D resources are created in these three kinds of MEMORY. When creating resources, we can specify the following storage flag,

D3DPOOL_DEFAULT, D3DPOOL_MANAGED, D3DPOOL_SYSTEMMEM, and D3DPOOL_SCRATCH. VM is located on the graphics card memory, the CPU can only access through the AGP or PCI-E bus, read and write speed is very slow, the CPU continuous write VM slightly faster than read,

Because 32 or 64 bytes are allocated to the CACHE when the CPU writes to the VM (depending on the CACHE
LINE Length). When the buffer is full, it will be written to the VM at a time. SM is the system memory, and the CPU reads and writes are very fast, Because SM is cached to Level 2 buffer,

However, GPUs cannot directly access system buffering. Therefore, the resources created in SM cannot be directly used by GPUs. AM is the most troublesome type, AM actually exists in the system memory, but this part of MEM will not be CPU
CACHE, which means that CPU read/write AM will write
CACHE
MISSING then accesses AM through the memory bus, so the CPU read/write AM is slower than SM, but the continuous write is slightly faster than read, because the CPU write AM uses "write
Combining ", and the GPU can access AM directly through the AGP or PCI-E bus.
If we use D3DPOOL_DEFAULT to create resources, it indicates that D3D
RUNTIME automatically uses the storage type based on the resource usage method we specify. Generally, it is VM or AM. The system does not perform additional backup elsewhere. when the device is lost,

These resources are also lost. However, the system does not use D3DPOOL_SYSTEMMEM or D3DPOOL_MANAGED to replace it during creation. Note that they are completely different POOL types and the texture created to D3DPOOL_DEFAULT cannot be CPU
Locked,
Unless it is a dynamic texture. However, the vb ib rendertarget back created in D3DPOOL_DEFAULT
BUFFERS can be locked. When you use D3DPOOL_DEFAULT to create a resource, if the memory is used up, the managed resource will be swapped out to release enough space.

D3DPOOL_SYSTEMMEM and D3DPOOL_SCRATCH are both in SM. The difference is that when D3DPOOL_SYSTEMMEM is used, the resource format is limited by Device performance, because the resources may be updated to AM or VM for use by the graphics system,

But SCRATCH is only limited by RUNTIME, so such resources cannot be used by graphics systems. D3DRUNTIME optimizes D3DUSAGE_DYNAMIC
Resource, which is generally placed in AM, but cannot be completely guaranteed. In addition, why can't static textures be locked or dynamic textures be used,
All related to D3D
The RUNTIME design will be described in the subsequent D3DLOCK description.
D3DPOOL_MANAGED indicates to make D3D
RUNTIME is used to manage resources. The created resources are copied in two copies, one in SM, the other in VM/AM, and the other in SM, d3D when the GPU needs resources
RUNTIME automatically copies data to the VM,

When the resource is modified by the GPU, the RUNTIME automatically updates it to the SM when necessary, and the modified resource in the SM will be updated to the VM. Therefore, do not use hosted data that is frequently modified by the CPU or GPU. This will produce a very expensive synchronization burden.

When the LOST
After DEVICE occurs, RUNTIME will automatically use COPY in SM to restore Data in VM during RESET, because not all data backed up in SM will be submitted to VM, therefore, the actual backup data can be much larger than the VM capacity. As resources increase,

The backup data may be switched to the hard disk. This is a RESET process that may become abnormal and slow. The RUNTIME keeps a timestamp for each MANAGED resource, when RUNTIME needs to copy the backup data to the VM, RUNTIME will allocate the memory space in the VM,

If the allocation fails, the VM has no available space. In this way, the RUNTIME uses the LRU algorithm to release related resources based on the timestamp. SetPriority uses the timestamp to set the resource priority, recently, commonly used resources will have a high priority, so that RUNTIME can pass through

The resource can be reasonably released after the priority is exceeded. The probability of using the resource immediately after the release is relatively small. The application can also call EvictManagedResources to forcibly clear all the MANAGED resources in the VM, in this way, if the next frame is useful to the MANAGED resources,

RUNTIME needs to be re-loaded, which has a great impact on the performance. It is generally not used at ordinary times, but this function is very useful during level conversion and can eliminate memory fragments in the VM. The LRU algorithm has performance defects in some situations, such as the resources required to draw a frame.

When the volume cannot be MANAGED by the VM, using the LRU algorithm will cause serious performance fluctuations, as shown in the following example:
BeginScene ();

Draw (Box0 );
Draw (Box1 );
Draw (Box2 );
Draw (Box3 );

Draw (Circle0 );
Draw (Circle1 );
EndScene ();
Present ();

Assume that the VM can only hold the data of five of the five ry, then according to the LRU algorithm, part of the data must be cleared before Box3 is drawn. Therefore, Circle0 ......, Obviously, clearing Box2 is the most reasonable, so this is why RUNTIME uses the MRU Algorithm for subsequent processing.

Draw Call can effectively solve the performance fluctuation problem, but whether the resource is used is detected by FRAME, not every DRAW
CALL is recorded, and the mark of each FRAME is the BEGINSCENE/ENDSCENE pair. Therefore, it is reasonably used in this case.

BEGINSCENE/ENDSCENE can greatly improve the performance of VMS when they are insufficient. According to the prompts In the DX document, we can also use the QUERY mechanism to obtain more information about RUNTIME.
Managed resource information, but it seems to be useful only in runtime debug mode,
Understand how RUNTIME MANAGE
RESOURCE is very important, but do not expose these details when writing programs, because these things often change. The last thing to note is that not only does RUNTEIME MANAGE
RESOURCE and DRIVER may also be real

With these functions, we can use the D3DCAPS2_CANMANAGERESOURCE flag to obtain information about whether the DRIVER implements the resource management function. You can also specify D3DCREATE_DISABLE_DRIVER_MANAGEMENT when creating the device to disable the function.

DRIVER resource management function.
D3DLOCK exploring D3D RUNTIME work
If the LOCK
What happens to the DEFAULT resource? The DEFAULT resource may be in the VM or AM. If it is in the VM, a temporary buffer must be opened in the System Content and returned to the data. After the application fills the data in the temporary buffer, RUNTIME

The data in the temporary buffer is transmitted back to the VM. If the resource D3DUSAGE attribute is not WRITEONLY, the system also needs to copy the original data from the VM to the temporary buffer, this is why not specifying WRITEONLY Reduces Program performance. CPU write AM is also

Note that the cpu write am is generally write combining, that is, the WRITE is buffered to a cache line.
When LINE is full, it is flushed into AM. The first thing to note is that the write data must be weak order.

(Graphic Data generally meets this requirement), it is said that D3DRUNTIME and NV
DIRVER has a small BUG, that is, when the CPU is not flushed to AM, the GPU begins to draw errors related to resources. In this case, use commands such as SFENCE to flush cache.
LINE. Second, try
If one cache line is fully written at a time, otherwise there will be an extra delay because the CPU must FLUSH the entire CACHE at a time.
LINE to the target, but if we only write some bytes in LINE, the CPU must first read the entire LINE long data from AM and then FLUSH again. Third, you can

Sequential WRITE, random WRITE will make the WRITE
COMBINING becomes cumbersome. If it is a random write resource, do not use D3DUSAGE_DYNAMIC to create it. Use D3DPOOL_MANAGED to complete the write in SM.

Common textures (D3DPOOL_DEFAULT) cannot be locked, because they are in the VM and can only be accessed through UPDATESURFACE and UPDATETEXTURE. Why does D3D not allow us to lock static textures, but let us lock static VB?
What about IB? I guess there may be two
The first reason is that the texture matrix is usually very large, and the texture is stored in two-dimensional mode in the GPU; the second reason is that the texture is stored in the GPU in NATIVE mode.
FORMAT is not in the rgba format. Because the dynamic texture indicates that this Texture needs to be modified frequently,

Therefore, D3D is specially stored. Dynamic textures with high-frequency modifications are not suitable for dynamic attribute creation. There are two scenarios: RENDERTARGET written by GPU and TEXTURE written by CPU.
VIDEO, we know that dynamic resources are generally placed in AM,

GPU access AM needs to pass through the AGP/PCI-E bus, the speed is much slower than the VM, and CPU access AM is much slower than SM, if the resource is dynamic attribute, this means that the GPU and CPU will continue to access resources, so it is best to use D3DPOOL_DEFAULT and D3DPOOL_SYSTEMMEM for such resources.

Create one copy each and manually perform two-way updates. Never use D3DPOOL_MANAGED as RENDERTARGET
Attribute creation, which is very inefficient and can be analyzed by yourself. For resources with less frequent changes, we recommend that you use DEFAULT to create and manually update them,

Because the efficiency loss of one update is far less than the cost of GPU continuous access to AM.

Unreasonable LOCK will seriously affect program performance, because generally the LOCK needs to wait for COMMAND
The rendering commands before the BUFFER can be returned only after they are fully executed. Otherwise, it is very likely that the resources in use will be modified, from the time when the LOCK is returned to the time when the UNLOCK is completed.

In Idle State, there is no reasonable use of GPU and CPU concurrency. DX8.0 introduces a new LOCK mark D3DLOCK_DISCARD, indicating that resources will not be read, but will only be fully written, in this way, the driver and RUNTIME work together and return to the application immediately

The VM address pointer outside the block, and the original pointer is discarded after this UNLOCK, so that the CPU
LOCK does not need to wait for the GPU to use resources. It can continue to operate on graphics resources (vertex buffering and index buffering). This technology is called vb ib renaming ).

A lot of confusion comes from the lack of underlying data. I believe that if MS opens the D3D source code and the open driver interface specification, NV/ATI will display the open driver and hardware architecture information, which can be easily understood.

By the way, I made a book advertisement: AI: a modern method. Chinese Version
Zhuo yuewang already has the goods and AI masterpiece, but reading it requires a considerable foundation and is very enlightening in thinking. Do not miss it if you want to buy it. In the future, I will transfer the learning focus from graphics to AI.

Interested friends can communicate with each other.

This article from the CSDN blog, reproduced please indicate the source: http://blog.csdn.net/udking/archive/2010/12/01/604
8211. aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.