"Original Translation"--the first knowledge of unity compute Shader

Last Update:2015-03-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

　　Have always wanted to try to translate some of their own things, now found that translation is really not easy, if you directly to the author's original text in accordance with the English thinking translation, you will find Chinese reading is very awkward, but if you want to fully use the Chinese language way to translate, but also afraid of their understanding is not in place, Contrary to the author's willingness. So I think a lot of times, the domestic translator is helpless, the next time you see the translation will also hold some empathy attitude to read. This is my first time to translate the whole article, limited ability, hope, translation is not good place also want to point out.

In fact, computeshader in unity has been a long time, because he has been interested in shader, so recently also in the attempt to learn computeshader, from the foreign forum to see the discussion explained that there are many people in use, But there is little practical use in the country. and Unity Official document has always been ambiguous principle, almost in the document does not see much useful information, only to Google a bit, found this article, this article is not very deep to computeshader, And I think that his original English is not very clear in some parts of the story, but for the basic description of the Computeshader are mentioned, as a reference material is very good.

Original link: http://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html

　　Here is the original translation:

I really like the simplicity of vertex v&f Shaders (which is our common vertex-element shader-translator note). They do only one thing (the vertices and colors are displayed on the screen), and they do a great job. But sometimes, this simplicity makes you feel constrained, and you may find yourself staring at a string of matrix operations occurring on the CPU, desperately trying to figure out how to put them in the picture!

Maybe I'm the only one worried about this. But anyway, Compute Shader solved the problem, and they're very simple to use, so I'm going to talk about Compute today. The basis of shader. First I'll show you the compute shader code that unity automatically generates for you. Then the compute shader as an example using the data structure buffer (structured buffer of data) as the end.

Compute shader can be used to control the position of particles.

What exactly is Compute Shader ?

　　In short, Compute Shader is a program running on a GPU that doesn't need to handle grid data or texture data, which works in OpenGL or DirectX's memory space (unlike OpenCL, which has its own memory space). They can output buffered data or textures and share memory among multiple executing threads.

Now Unity only supports DX11 (DirectX one) Compute Shaders, but if you update OpenGL to version 4.3, we also have the hope of using this Mac powder.

This also means that this will be the first tutorial to date for the Windows platform only. So if you're using a machine that doesn't have windows now, these things might not help you.

So what are they good at? ( and not good at what? )

Two words: math and parallelism. Any problem that involves doing an identical set of operations on every element in the dataset (without conditional branching) is good for it. And the larger the set of operations, the more rewards you get from the GPU.

Conditional branching can seriously affect efficiency because the GPU is not very good at dealing with this situation. But it's not much different from writing v&f shader, so if you have some experience with v&f shader, you won't have too much trouble writing them.

There is also a problem with latency, and it takes time to pass data from GPU memory back to the CPU, which can be a bottleneck for you to take advantage of compute shader. But you can also reduce the amount of time it takes to pass by optimizing kernel programs (kernels) to work with as little buffer data as possible. But the problem is still impossible to avoid altogether.

Do you understand? ? Good , let's get started.

　　Since we are using DX, that compute Shader will be written in HLSL grammar. This is not much different from other shader languages. If you've written CG or GLSL, then there's no problem (it's also my first time to write HLSL).

The first thing you have to do is create a computeshader,unity project panel that already has this item, so this step should be very simple, if you open this newly created ask price, you will see the following automatically generated code (for brevity, I have deleted the comments).

1 #pragmaKernel Csmain2 3Rwtexture2d<float4>Result;4 5[Numthreads (8,8,1)]6 voidCsmain (uint3 id:sv_dispatchthreadid)7 {8RESULT[ID.XY] = FLOAT4 (Id.x & Id.y, (Id.x & the)/15.0, (Id.y & the)/15.0,0.0);9}

This code is really good for understanding compute shader Ah, let's look at the following line:

1 #pragma kernel csmain

This defines the entry point of the program (equivalent to the main function of the compute shader). A compute shader file can have more than one method declaration, and you can call any one you want in the script.

1 rwtexture2d<float4> Result;

This declares a variable, shader will use the data contained in this variable to work, because we do not initially use grid data to work, so you have to explicitly declare your compute shader to read or write what data. The "RW" in front of the data type indicates that the variable is readable and writable.

1 [Numthreads (8,8,1)]

This line specifies the size of the thread group to be produced by the current compute shader. The GPU has massively parallel processing power by creating multiple threads that run concurrently. Thread groups Specify how these generated threads are organized. In the above code, we specify that we want each thread group to contain 64 threads. Just like a two-dimensional array.

Determining the optimal size of a thread group is a very complex issue, and it has a big relationship with your target hardware. In general, think of your GPU as a collection of stream processors, each of which can execute X threads at the same time. A processor runs one thread group at a time, so ideally you want your thread group to contain x threads to take full advantage of the processor. The value I set is only based on my own situation, so rather than giving you advice on setting the optimal value, you should go to Google yourself (and then share it on Twitter: D)

The rest of the shader code is normal, and the kernel program function determines which pixel it should handle based on the ID of the running thread. Then write some data into the buffer. Simple, huh?

really come on run shader

Because compute Shader is not running with grid data, obviously we can't hang it on a mesh to let him run. Compute Shader needs to be assembled and called in the script as follows:

1  PublicComputeshader shader;2  3 voidRunshader ()4 {5 intKernelhandle = shader. Findkernel ("Csmain");6  7Rendertexture Tex =NewRendertexture ( the, the, -);8Tex.enablerandomwrite =true;9 Tex. Create ();Ten   OneShader. SetTexture (Kernelhandle,"Result", Tex); AShader. Dispatch (Kernelhandle, the/8, the/8,1); -}

　　This code has a lot to explain, the first is to set his Enablerandomwrite property before creating a rendertexture. This allows your compute shader to have write access to this texture. If you do not set this tag, you cannot use this map as a write target in your shader.

Next we need to specify which kernel program we want to invoke in compute shader, and the Findkernel method requires a string parameter as the name, which can be shader any of the relevant kernel programs. Just like the one we wrote at the beginning of shader. A compute shader can have multiple kernel programs in a single file.

Computeshader.settexture This call allows us to pass the data needed by shader from CPU memory to GPU memory. Just two memory before passing data will introduce delay to our program, and the degree of program efficiency being reduced is proportional to the size of the data you need to pass. For this reason. If you're going to run your shader every frame, you'd better take a serious look at how much data you really need to manipulate.

The three integers passed to the dispatch method define the number of thread groups that we want to generate. Recall that the size of each thread group is specified by the numthreads in compute shader, so in the example above, the total number of threads we generate is as follows:

32*32 Thread Group *64 (number of threads per thread Group) = 65,536 threads.

This is the end of a thread that corresponds to one of the pixels in the rendertexture we are dealing with, that is, a call to the kernel program that processes only one pixel.

Now that we know how to write computeshader and how to handle texture memory, let's see what we can do with these things.

Structural buffering (structured buffers) That 's a good thing.

Working with texture data is a bit like our previous v&f shader, which makes me not feel excited. It's time to free up our GPU so that it can take advantage of any data. Yes, it can be done, just as well as it sounds.

A structure buffer is a series of data that contains only one data type, and you can create a structure buffer that stores the float type, or an int type, but cannot create one that stores both float and Int. You can declare the structure buffer in Computeshader as follows:

1 structuctedbuffer<float> floatbuffer; 2 rwstructuredbuffer<int> Readwriteintbuffer;

What makes the structure buffer really interesting is that it can be used to store data of struct types. We will explain in the second example.

For our example, we are going to pass some vertex data to our computeshader, and then pass a matrix for each vertex data that we want to transform. We can do this by creating two buffers (one to store Vector3 data and one to store matrix4x4 data). We can easily abstract the two structures into a point/matrix pair structure, so let's do it.

In our C # script, we define the following data types:

1 struct Vecmatpair 2 {3 public Vector3 point; 4  Public matrix4x4 Matrix; 5 }

we also have to define the corresponding structure in shader, but HLSL does not provide a matrix4x4 or Vector3 type. But it has data types that have the same memory structure as they do. Our shader will eventually look like the following:

1 #pragmaKernel Multiply2 3 structVecmatpair4 {5 FLOAT3 Pos;6 float4x4 Mat;7 };8 9Rwstructuredbuffer<vecmatpair>DataBuffer;Ten  One[Numthreads ( -,1,1)] A voidMultiply (uint3 id:sv_dispatchthreadid) - { -Databuffer[id.x].pos = Mul (Databuffer[id.x].mat, FLOAT4 (Databuffer[id.x].pos,1.0)); the}

Note that our thread groups are now organized into a one-dimensional sequence. The dimensions that the thread group is set to have no effect on performance, so you can choose the one that best suits your application.

Creating a structure buffer inside the script is somewhat different from the one we created earlier. For a buffer, you need to specify the byte size that each element in the buffer occupies, and for the struct in our example, the size of the bytes it occupies is actually the number of floats we used (vector 3, matrix 16) multiplied by the size of each float ( 4 bytes). The creation process is as follows:

1  PublicComputeshader shader;2     3 voidRunshader ()4 {5vecmatpair[] data =Newvecmatpair[5];6     //INITIALIZE DATA here7     8Computebuffer buffer =NewComputebuffer (data. Length, the);9     intKernel = shader. Findkernel ("Multiply");TenShader. SetBuffer (Kernel,"DataBuffer", buffer); OneShader. Dispatch (kernel, data. Length,1,1); A}

Now we need to pass the modified data back to the format that we can use in the script, and don't want to deal with the rendertexture example above. The structure buffer needs to be explicitly passed back to the CPU from the GPU memory. In my experience, this is the point that you will encounter the most impact performance when using Computeshader. The only way I've found to mitigate it now is to optimize your buffering so that it's as small as possible without affecting your use, and only passes data out of the shader when you really need it.

The code that passes the data to the CPU is very simple, and all you have to do is receive it in a buffer of the same type. We modify the script above so that it uploads the results of the shader calculation to the second sequence, as follows:

1  PublicComputeshader shader;2 3 voidRunshader ()4 {5vecmatpair[] data =Newvecmatpair[5];6vecmatpair[] Output =Newvecmatpair[5];7 8 //INITIALIZE DATA here9 TenComputebuffer buffer =NewComputebuffer (data. Length, the); One intKernel = shader. Findkernel ("Multiply"); AShader. SetBuffer (Kernel,"DataBuffer", buffer); -Shader. Dispatch (kernel, data. Length,1,1); - buffer. GetData (output); the}

　　
  That's all, you should go to the profiler and get a sense of how much time it takes to pass data from the GPU to the CPU. But I find that if you use your compute shader to handle a large enough data set, these costs are worth it.

If you have any questions (or point out the error in this article), send me a message on Twitter, I will not write you shader, but I would like to guide you, Happy shading!

respect for the wisdom of others, welcome reprint, please specify the author Esfog, the original address　　 http://www.cnblogs.com/Esfog/p/Translation_BeginStart_ComputeShader.html

"Original Translation"--the first knowledge of unity compute Shader

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Original Translation"--the first knowledge of unity compute Shader

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Original Translation"--the first knowledge of unity compute Shader

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support