DirectCompute & DirectX 11 computing coloring machine programming Overview)

Source: Internet
Author: User
Document directory
  • Sample Code
  • Device management
  • Select the video card to use
  • Run the computing coloring Tool
  • Resources in DirectCompute
  • Computing coloring machine (CS) HLSL Programming

Note: DirectX has always been the core technology of Windows and game development. DirectX provides a program running on the graphics card-the Shader ). Before DirectX 11, the shader was bound to a specific rendering step, such as a pixel shader and a vertex shader. Since DirectX11, DirectX has added a Compute Shader, which is specially designed for general computing unrelated to graphics. Therefore, DirectX becomes a general GPU computing platform. Given that the GPU has extremely powerful parallel computing capabilities, learning to use DirectCompute makes sense. Most people have never used DirectX graphical interfaces and lack experience in using DirectX. This article is an article about the general computing technology of DirectCompute from scratch, without the need for Graphic programming experience, so I translated it. If you are interested, you can take a look at learning.

Address: http://openvidia.sourceforge.net/index.php/DirectCompute

This article will introduce the DirectCompute program design, which aims to show some concepts of the DirectCompute program design from the beginning for those who have no DirectX programming experience. This article also introduces the Compute Shader ). I hope this article will help you learn about the general GPU Computing Technology Using DirectCompute.

Sample Code

Complete sample code the Link contains the minimum sample code (. cpp) required by a console-Based Complete DirectCompute program ). This program is a console program without any window or graphic code.

Computing coloring code this link contains the complete coloring code (. hlsl)

The example program provides an in-depth view of the following steps required to run the computing shader:

  1. Initialize the device and context
  2. Load the coloring program from the HLSL file and compile it.
  3. Create and initialize resources (such as buffers) for the shader)
  4. Set the Shadow status and execute
  5. Returns the operation result.

Next we will discuss each step one by one

Device management

Basically, DirectCompute can be fully implemented only through the computing coloring machine 5.0 (Compute Shader) programming model (that is, CS 5.0. However, CS 5.0 requires DirectX 11 hardware. Without the Direct X 11 hardware (the Direct X 11 hardware was rare at the time of writing this article, and only the Ati HD5000 series of graphics cards), we can still program DirectCompute, one method is to use the reference hardware mode (software simulation), and the other method is to use backward configuration, ON DirectX 10 hardware, run the "computing coloring machine 4.0" (for example, nVidia G80, G92, and GT200 series graphics cards) that can implement partial computing coloring machine capabilities ). The practice is:

  • Use the DirectX 11 API to write programs (such as calling ID3D11 ...)
  • Create a DX11 device, but specify the DX10 and CS 4.0 feature levels when creating the device.

The following code demonstrates how to call D3D11CreateDevice multiple times... () Method, each time you create a different driver type (reference type for software simulation or hardware type for real GPU acceleration), and different feature levels (DX10, DX10.1, or DX11)

D3D_FEATURE_LEVEL levelsWanted [] = {average, minimum, maximum}; UINT numLevelsWanted = sizeof (levelsWanted)/sizeof (levelsWanted [0]); D3D_DRIVER_TYPE driverTypes [] = {average, d3D_DRIVER_TYPE_HARDWARE,}; UINT numDriverTypes = sizeof (driverTypes)/sizeof (driverTypes [0]); // traverse each driver type. First, try to see the driver, then, the hardware driver // exits the loop after one is successfully created. // You can change the above Order to try various configurations. // Here we only need to refer to the device to demonstrate the API call for (UINT driverTypeIndex = 0; driverTypeIndex <numDriverTypes; driverTypeIndex ++) {inclug_drivertype = driverTypes [driverTypeIndex]; UINT createDeviceFlags = NULL; hr = D3D11CreateDevice (NULL, g_driverType, NULL, createDeviceFlags, levelsWanted, success, success, & g_pD3DDevice, & g_pD3DContext );}

After the code is successfully run, it generates a device pointer, a context pointer, and a Flag with a feature level.

NOTE: For the sake of simplicity, the above Code skips many variable declaration codes. Complete sample code must be supplemented. The code snippets here are only used to show what happens in the program.

Select the video card to use

You can use the idxgifacloud object to enumerate the installed video cards in the system, as shown in the following code. Create an idxgifacloud object, call EnumAdapters, and input an integer that represents an enumeration of the video card. If it does not exist, it returns DXGI_ERROR_NOT_FOUND.

// Obtain all installed graphics card std: vector <IDXGIAdapter1 *> vAdapters; IDXGIFactory1 * factory; CreateDXGIFactory1 (_ uuidof (IDXGIFactory1), (void **) & factory ); IDXGIAdapter1 * pAdapter = 0; UINT I = 0; while (factory-> EnumAdapters1 (I, & pAdapter )! = DXGI_ERROR_NOT_FOUND) {vAdapters. push_back (pAdapter); ++ I ;}

Next, when you call D3DCreateDevice to create a device, input the desired graphics adapter pointer from the first parameter and set the driver type to D3D_DRIVER_TYPE_UNKNOWN. For more information, see D3DCreateDevice help in D3D11.

g_driverType = D3D_DRIVER_TYPE_UNKNOWN;hr = D3D11CreateDevice( vAdapters[devNum], g_driverType, NULL, createDeviceFlags, levelsWanted,             numLevelsWanted, D3D11_SDK_VERSION, &g_pD3DDevice, &g_D3DFeatureLevel, &g_pD3DContext );
Run the computing coloring Tool

Note: The Shader is a program running on the video card. It is not the same as the Program executed on the CPU. It can be written using HLSL (see the following article ).

The computing coloring tool in the DirectCompute program is executed through the Dispatch function:

// Assign (run) the computing shader and divide it into 16 or 16 thread groups. G_pD3DContext-> Dispatch (16, 16, 1 );

The preceding statements assign 16 or 16 thread groups.

Note that the input of the coloring er is generally considered as "state ". That is to say, you should set the State before the color program is assigned. Once the state is assigned, the "State" determines the value of the input variable. Therefore, the distribution code of the shader should generally look like this:

Pd3dImmediateContext-> CSSetShader (...); pd3dImmediateContext-> CSSetConstantBuffers (...); pd3dImmediateContext-> CSSetShaderResources (...); // CS input // CS output pd3dImmediateContext-> CSSetUnorderedAccessViews (...); // run CSpd3dImmediateContext-> Dispatch (dimx, dimy, 1 );

All the above constant buffers (constant buffer), buffers, etc. can be seen in the coloring program that the stuff is done by calling CSSet before the thread is assigned... .

Synchronize with CPU

Note that all the above calls are asynchronous. The CPU always returns immediately before the specific execution. If necessary, the CPU call thread stops and waits for the completion of all asynchronous operations only when the buffer "ing" operation is called later (see the buffer section below.

Event: basic profiling and synchronization operations

DirectCompute provides a "query"-based event mechanism API. You can create, insert, and wait for queries in a specific State to determine when the coloring machine (or other asynchronous calls) is executed. In the following example, a query is created. Then, wait for the query to make sure that all the operations to be executed have been executed at a certain point, and then assign the coloring er, finally, wait for another query and confirm that the coloring program has been executed.

Create a query object:

D3D11_QUERY_DESC pQueryDesc;pQueryDesc.Query = D3D11_QUERY_EVENT;pQueryDesc.MiscFlags = 0;ID3D11Query *pEventQuery;g_pD3DDevice->CreateQuery( &pQueryDesc, &pEventQuery );

Insert a "Fence" in a series of calls and wait. If the query information does not exist, GetData () returns S_FALSE.

G_pD3DContext-> End (pEventQuery); // insert a fence in pushbuffer while (g_pD3DContext-> GetData (pEventQuery, NULL, 0, 0) = S_FALSE) {}// spin wait event ends g_pD3DContext-> Dispatch (, x, y, 1); // start the coloring machine g_pD3DContext-> End (pEventQuery ); // insert a fence in pushbuffer while (g_pD3DContext-> GetData (pEventQuery, NULL, 0, 0) = S_FALSE) {}// spin wait event ends

Finally, use this statement to release the query object:

pEventQuery->Release();

Please be careful when creating and releasing query objects to avoid too many queries (especially when you are processing a frame ).

Click here to view MSDN help for query

Resources in DirectCompute

A resource refers to the data that can be accessed by a GPU or a CPU. It is the input and output of a coloring machine. Including buffer, texture, and other types.

In DirectX, resources are created according to the following steps:

  1. First, create a resource descriptor to describe the resource to be created. The resource descriptor is a struct containing many flags and required resource information.
  2. Call a Create system method, pass in the descriptor as a parameter, and Create a resource.
Communication between CPU and GPU

The gD3DContext-> CopyResouce () function can be used to read or copy resources. Here, Replication refers to the replication between two resources. If you want to copy between the CPU and GPU (between memory and memory), you must first create a "transit" resource on the CPU side. The intermediate resource can be mapped to the memory pointer of the CPU so that data can be read or copied from the intermediate resource. Then, the ing of intermediate resources is removed, and the GPU is copied using the CopyResource () method.

Performance of buffer replication between CPU and GPU

CUDA-C language (CUDA is nVidia's GPU general-purpose computing platform) can be assigned a pinned host pointer and write combined host pointer, they allow you to perform GPU data replication with the best performance. In DirectCompute, the "usage" attribute of the buffer determines the memory allocation type and access performance.

  • D3D11_USAGE_STAGING usage resources are system memory and can be read and written directly by GPU. However, they can only be used as the source or target of a copy operation (CopyResource (), CopySubresourceRegion (), rather than directly in the shader.

    • If D3D11_CPU_ACCESS_WRITE flag is specified during Resource Creation, the performance from CPU to GPU replication is optimal.
    • If D3D11_CPU_ACCESS_READ is used, the resource is cached by the CPU, and the performance is low (but the retrieval operation is supported)
    • If both values are specified, READ takes precedence over WRITE.
  • D3D11_USAGE_DYNAMIC (only for buffer-type resources, not for texture resources) is used for fast CPU-> GPU memory transmission. This type of resource can be used not only as a replication source and a target, but also as a texture (in D3D terminology, called the resource view ShaderResourceView) to read in the shader. However, the coloring er cannot write such resources. These resource versions are controlled by the driver. Every time you use DISCARD flag to map memory, if this memory is still used by the GPU, the driver generates a new memory and does not wait until the GPU operation ends. It provides a stream to deliver data to the GPU.

Click here to view the MSDN help for Usage

Structured buffer and out-of-order access View

A very important feature of ComputeShader is the structured buffer zone and out-of-order access view. The structured buffer can be accessed like an array in the computing shader. Any thread can read and write any location (that is, the parallel program sends scatter and collects the gather action ). Unordered access view (UAV) is a mechanism to bind resources created by the caller to the shader and allow ...... Out-of-order access.

Declare a structured Buffer

We can use D3D11_RESOURCE_MISC_BUFFER_STRUCTURED to create a structured buffer. The flag specified below indicates that the access to the shadow is allowed in disorder. The default usage below indicates that it can be read and written by the GPU, but it needs to be copied to the intermediate resource to be read and written by the CPU.

// Create a structured buffer // D3DXVECTOR4 declaration in D3DX10Math. h In // http://msdn.microsoft.com/en-us/library/bb205130 (VS.85 ). aspxD3D11_BUFFER_DESC sbDesc; sbDesc. bindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE; sbDesc. usage = D3D11_USAGE_DEFAULT; sbDesc. CPUAccessFlags = 0; sbDesc. miscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; sbDesc. structureByteStride = sizeof (D3DXVECTOR4); sbDesc. byteWidth = sizeof (D3DXVECTOR4) * w * h; hr = g_pD3DDevice-> CreateBuffer (& sbDesc, NULL, & pStructuredBuffer );
Declare out-of-order access View

Next, we declare an out-of-order access view. Note that you need to give him a pointer to the structured buffer.

// Create an out-of-order access view pointing to the structured buffer D3D11_UNORDERED_ACCESS_VIEW_DESC sbUAVDesc; sbUAVDesc. buffer. firstElement = 0; sbUAVDesc. buffer. flags = 0; sbUAVDesc. buffer. numElements = w * h; sbUAVDesc. format = DXGI_FORMAT_UNKNOWN; sbUAVDesc. viewDimension = D3D11_UAV_DIMENSION_BUFFER; hr = g_pD3DDevice-> CreateUnorderedAccessView (pStructuredBuffer, & sbUAVDesc, & g_pStructuredBufferUAV );

Then, we need to activate the structured buffer used by the shader before allocating the coloring er thread:

g_pD3DContext->CSSetUnorderedAccessViews( 0, 1, &g_pStructuredBufferUAV, &initCounts );

After the thread is dispatched, if you use the CS 4.x hardware, you must unbind it. Because CS4.x supports binding only one UAV to each rendering pipeline.

// When running on the D3D10 hardware: Each pipeline can only bind one UAV // set to NULL to unbind ID3D11UnorderedAccessView * pNullUAV = NULL; g_pD3DContext-> CSSetUnorderedAccessViews (0, 1, & pNullUAV, & initCounts );
Constant buffer in DirectCompute

The constant buffer is a set of data that cannot be changed when the computing coloring tool is running. As a graphics program, constant buffering can be a perspective matrix or color constant. In general-purpose computing programs, constant buffering can store data such as signal filter weights and image processing instructions.

To use constant buffering:

  • Create a buffer
  • Initialize data using memory ing (you can also use the effect Interface)
  • Use CSSetConstantBuffers to set the constant buffer Value

The following code creates a constant buffer. Pay attention to the constant buffer size. Here we know that in HLSL, It is a quad-element vector.

// Create constant buffer // D3DXVECTOR4 declaration in D3DX10Math. h In // http://msdn.microsoft.com/en-us/library/bb205130 (VS.85 ). aspxD3D11_BUFFER_DESC cbDesc; cbDesc. bindFlags = D3D11_BIND_CONSTANT_BUFFER; cbDesc. usage = D3D11_USAGE_DYNAMIC; // CPU writable, so that we can update data cbDesc at each frame. CPUAccessFlags = D3D11_CPU_ACCESS_WRITE; cbDesc. miscFlags = 0; cbDesc. byteWidth = sizeof (D3DXVECTOR4 );

Next, we use the memory ing method to send data to the constant buffer. Generally, programmers define the same struct as HLSL In the CPU program, so they use sizeof to get the size, and then map the buffer pointer to the struct to fill the data.

// Must use D3D11_MAP_WRITE_DISCARD // http://msdn.microsoft.com/en-us/library/bb205318 (VS.85 ). aspxD3D11_MAPPED_SUBRESOURCE mappedResource; g_pD3DContext> Map (pConstantBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, & mappedResource); unsigned int * data = (unsigned int *) (mappedResource. pData); for (int I = 0; I <4; I ++) data [I] = I; g_pD3DContext-> Unmap (pConstantBuffer, 0 );

Note that the input variable (here, the constant buffer) of the coloring machine is regarded as a "state" variable. Therefore, CSSet must be used before allocating the coloring machine... () Function setting status. In this way, these variables can be accessed during the calculation of the Shadow execution.

// Activate g_pD3DContext-> CSSetShader (g_pComputeShader, NULL, 0); g_pD3DContext-> CSSetConstantBuffers (0, 1, & pConstantBuffer );

Then we can declare some variables in the shader as constant buffer variables:

cbuffer consts {    uint4 const_color;};

Finally, you can use them in the coloring tool code:

uint4 color = uint4( groupID.x, groupID.y , const_color.x, const_color.z );
Constant Buffering

You can also define multiple groups of different constant buffers. Generally, their values must be updated together. As long as you write it like this, you can:

cbuffer consts {    uint4 const_color_0;    uint4 const_color_1;};

It can even be written as follows:

cbuffer consts {  uint4 const_color_0;};cbuffer more_consts {  uint4 const_color_1;};

If const_color_0 needs to be updated every call, And const_color_1 needs to be updated every 100 calls, it is useful to write. To set them separately, just create a buffer and then map the memory. Then, specify a "slot" value for each constant buffer before assigning the computing coloring tool. The slot value determines the appearance of data in HLSL.

g_pD3DContext->CSSetConstantBuffers( 0 ,1,  &pConstantBuffer );g_pD3DContext->CSSetConstantBuffers( 1 ,1,  &pVeryConstantBuffer );g_pD3DContext->CSSetShader( g_pComputeShader, NULL, 0 );

Finally, when the computing shader is running, the shader pointed to by g_pComputeShader can access the two constant buffers.

Computing coloring machine (CS) HLSL Programming

The computing Shader running on the graphics card is written in HLSL (High Level Shader Language advanced coloring Language. In our example, it exists in the form of text and is dynamically compiled at runtime. Computing coloring er is a program that is executed by many threads in parallel by a single program. These threads are divided into multiple "thread groups". The threads in the thread group can share data or synchronize data with each other.

Thread groups are created by calling Dispatch. For example, Dispatch (16, 16, 1) creates 16x16 thread groups. The threads in each thread group are specified using this syntax in the shader code:

[numthreads( 4, 4, 1)]

We recommend that you use # define to define the number of threads in the thread group (here, 4, 4) as a constant so that you can use this value in the coloring er code.

// Thread group size # define thread_group_size_x 4 # define thread_group_size_y 4 RWStructuredBuffer <BufferStruct> g_OutBuff;/* This indicates the number of threads in the thread group, in this example, 4x4x1 = 16 threads * // equivalent to [numthreads (4, 4, 1)] [numthreads (thread_group_size_x, thread_group_size_y, 1)] void main (uint3 threadIDInGroup: SV_GroupThreadID, uint3 groupID: SV_GroupID, uint groupIndex: SV_GroupIndex, uint3 dispatchThreadID: callback) {int timeout = 16; // assume 16, this is the distributed dispatch (16, 16, 1) int stride = thread_group_size_x * N_THREAD_GROUPS_X; // buffer span. Assume that the span is the data width (no margin) int idx = dispatchThreadID. y * stride + dispatchThreadID. x; float4 color = float4 (groupID. x, groupID. y, dispatchThreadID. x, dispatchThreadID. y); g_OutBuff [idx]. color = color ;}

All threads execute the same function. Each thread has its own unique thread ID, and each thread group has its own group ID. These IDS are usually used to calculate the location to be accessed in an array. In this way, you can open any number of threads and allow them to access each element in the array in parallel. The function of the calculation coloring er can only accept any combination of the following parameters. They represent special meanings:

  • Uint3 threadIDInGroup: SV_GroupThreadID (group thread ID, three dimensions)
  • Uint3 groupID: SV_GroupID (thread group ID, three dimensions)
  • Uint groupIndex: SV_GroupIndex (a linear group ID, calculated from three dimensions, as in a grating Operation)
  • Uint3 dispatchThreadID: SV_DispatchThreadID (the cross-group thread ID among all the dispatched threads, in three dimensions)

 

Note: This article will help you understand a lot of getting started with Compute Shader programming. However, it does not introduce many important HLSL syntaxes, atomic operations, synchronization, and other important content of the computing shader program, nor does it introduce registers, resource views (SRV), and 2D textures, supports reading and writing textures. To make full use of DirectCompute, you really need to learn a lot ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.