Cuda-based Ray Tracing Algorithm

Last Update:2018-12-05 Source: Internet

Author: User

Tags gtx

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Light tracing is one of the mainstream rendering technologies currently. It can easily simulate complex lighting effects and generate high-quality images, which has been widely used in many fields, such as realistic rendering, virtual reality, visualization, and computer animation. However, the computing overhead of the ray tracing algorithm is large, which hinders its application efficiency. The main operation of Ray tracking and rendering is spent in the Process of intersection of light and scene. to speed up this computation process, a large number of technologies have been studied to accelerate the intersection operation. The establishment of a certain spatial organizational structure to accelerate the settlement of computers has made great progress, such as Scenario-Based Spatial hierarchy.

The hierarchical tree structure of the scenario is divided and organized, so that the blank area can be expressed as a large node, thus reducing the overhead of light passing through the blank area. Currently, it has been proposed by scholars that the commonly used spatial partitioning methods include even grid partitioning, hierarchical surrounding partitioning, octotree, KD-tree, and so on. These methods have been studied in previous papers, and get a good acceleration effect. KD-tree is the most commonly used one and proved to have better results.

1. Scenario Space Division based on KD-tree

In order to accelerate the collision detection process between scenes and light, it is indispensable to divide the scene space. Currently, the main spatial Partitioning technologies are balanced grid based on spatial division, octotree, KD-tree, etc,

And a hybrid structure based on these classification methods. These structures have their own characteristics. You need to select the corresponding Partitioning Method Based on the characteristics of the scenario and the drawing task. No space partitioning method can be used.

Optimal Processing for various scenarios. However, the basic principle for accelerating the intersection detection between light and scene through spatial division is the same, that is, the original ry element in the scene is abstracted and then organized up to the layer.

For example, in the form of a box, the box is first detected by collision with the light to eliminate the intersection test between the geometric element and the light that cannot be intersecting with the light.

Advantages of 1.1 KD-tree

There can be different spatial division methods for different scenarios. However, the mainstream spatial division method for ray tracing is KD-tree, mainly because of the following reasons:

1. KD-tree is a special BSP tree, which degrades any split plane in the BSP division to an axis-aligned split plane, thus reducing the geometric metasegmentation operation when the BSP is generated. But it also has a BSP tree.

To a certain extent, the heuristic spatial division can make the final partition binary tree as balanced as possible, this reduces the intersection times between the light and the Division structure in the worst case, thus improving the efficiency.

2. Because KD-tree is a binary tree, the Traversal Time is simpler than that of a non-binary tree, such as the eight-tree, which also promotes its use range.

3. Finally, it is very important that the simplicity of binary tree traversal makes it easier for them to be transplanted into GPUs that do not support stack recursive structures.

In view of the above advantages of KD-tree, this paper also uses the structure of KD-tree to divide Spatial Scenarios.

1.2 KD-tree Creation

The process of creating a KD-tree is a top-down recursive process. The bottom space is divided from the root node. The most main operation in the space division process is to select the split plane. Split plane of each node in KD-tree

Compared with the split plane of the Eight-tree and even split structures, the split plane increases the degree of freedom, but it reduces the degree of freedom compared with the split plane in any direction in the BSP split. The split plane of the KD-tree is axis-aligned,

In this way, we can directly segment the normal vector of the plane. Next we will mainly locate the directed plane in the axis alignment direction. The basic principle for locating the split plane is to make the entire KD-tree ready.

This requires that the geometric element in the current node be allocated evenly to the two subnodes as much as possible. For 3D scenarios composed of grids, the geometric element here is mainly a triangle, so the current problem is transformed

Allocation of triangles in the current node.

Given a KD-Tree node Containing N triangles, it is assumed that the split plane in the X direction needs to be located. Any triangle and its possible split plane can be obtained by spatial knowledge

There are three relative relations: intersection with the plane, in front of the plane (relative to its normal direction), in the rear of the plane. For non-Intersecting triangles, they can be directly allocated when subknots are generated. For intersecting triangles, they must be split when subknots are generated. After a triangle is split by a plane, multiple triangles may be generated, and the method of triangle segmentation varies depending on the position of the split plane. If the number of subtriangles obtained after dividing these intersecting triangles also takes into account the selection of the split plane, a lot of computation is added, in a node, the number of triangles that intersect the split plane must be less than the number of three triangles that do not intersect. Therefore, the influence of these triangles at the intersection position can be ignored when selecting the split plane. In this way, the split plane can be obtained quickly when the child node is generated. However, in the direction of a normal vector, there are several possible positions in the potential split plane. In practice, there is only one approximate operation. Here we use a common method, that is, the AABB box of the ry element is used as the potential position of the split plane in the split plane method. In this way, both the distribution location of the potential plane and the selection criteria of the split plane are determined. The next step is to select an optimal split plane from the potential split plane.

If the potential split plane set of the current node is complete and the triangle position set corresponding to each split plane is set, the optimal split plane Pb selection can be determined by the following formula:

In the preceding formula, the number of potential split planes is twice the speed of the number of triangles in the current node. This is because the AABB of each triangle determines a potential split plane at both ends of the axis alignment. After the split plane is determined, it is equivalent to completing the split operation on the current node. The next step is to follow the relative relationship with the split plane, divide the triangle or directly or divide it and then assign it to different subnodes. After dividing the current node, perform the preceding operation on the child node recursively until the leaf node. There are also multiple methods to define leaf nodes. After a tree is created to a certain depth, the nodes after the depth are leaf nodes, or after the number of triangles in the above nodes reaches a certain number, A node less than this value is a leaf node. The method used in this article is to become a leaf node when the number of triangles in the node is less than a certain number. After reaching the leaf node, the current split operation can be stopped.

2. Spatial binary tree traversal without stacks

After the three-dimensional scenario is organically and reasonably organized using spatial division, the entire scenario is abstracted into a tree structure, the intersection and acceleration method of light and scene is benefiting from such a tree structure that can quickly eliminate the influence of irrelevant nodes. The tree-based light intersection algorithm is transformed into a tree traversal problem. According to the characteristics of the ray tracing problem, in most cases, it is necessary to solve the intersection with the closest geometric element of the current light. Therefore, the depth-first Traversal method is used for the traversal of the spatial partition tree. The depth-first Traversal method means that when the current node has both the child node and the sibling node, the next search object is its child node, not the sibling node. Tree traversal is also a call process of recursive structure, which performs the same operation on each node, this traversal algorithm is easily implemented on the traditional CPU with the help of stacks. However, the GPU does not support such recursive call operations because it has no similar stack structure. Therefore, the traversal of Spatial Binary Trees cannot be implemented using the traditional recursive call method. In this article, we use a non-stack method to implement spatial binary tree traversal on the GPU platform.

2.1 No stack Solution

The solution without a stack in this article can also be understood as a solution to simulate a stack. Because deep preference search is used during the traversal of the spatial split tree, there is a process of backtracing from the current node to the search path selection, that is, you need to constantly find the corresponding parent node from the child node. In the stack of the CPU, you can easily use the recursion feature to add the parent node to the stack in sequence, so that after the child node is exited, the corresponding parent node can be obtained after the stack is exited. However, hardware cannot be used on the GPU, but such a process can be simulated to create a class stack structure and record some data by using some variables, to record the path of the current search, and use the existing path to obtain the path to be searched. For example, at the current node, we can use a set of data to store the label of the current node, the status of the subnode, and so on, if one of its children is in the traversal status, you do not need to search for the child node again when you trace back to this node, perform the same operation on other subnodes to simulate the effect of back-to-depth priority search.

Figure 1. Perform a non-stack traversal operation on the Spatial Binary Tree.

2.2 Implementation Details

The main implementation strategy of the spatial tree without stack traversal is to establish a virtual class stack structure to record node information, so as to control the traversal trend. We need to record the status of each node. In traversal, there are only two statuses for each node: Already traversed, last traversed, the two States of mutual exclusion are easily marked using the switch value. At the same time, for a binary tree, each node has a maximum of two subnodes, so the traversal direction on the current node also has a maximum of two States. In this way, two switch values are required for each node to record the traversal process at this node. One of the most space-saving methods for the switch value and high judgment efficiency is to use bits for representation, and the depth of the Binary Tree obtained by scenario division is limited, therefore, a variable with a certain length can be used within the prediction range to represent a deep binary tree traversal simulation stack. GPU-based binary tree traversal without stack space: Two intls are used to represent two 64-depth simulated stacks, each digit in the simulation stack records the state of the subnode of the current node, and obtains the state of its subnode through the bitwise operation from the simulation stack at the current node, in this way, the traversal direction is controlled to achieve deep search or back up.

3. GPU-based accelerated light Tracing Algorithm

Ray Tracing is a classic algorithm in the computer graphics field. It simulates the imaging principles in the world and colors each pixel. This mainly involves the following questions: how the light is transmitted, where the light is transmitted, and where the light is colored. When light is directed to the surface of an object, the object will reflect, transmit, absorb, and refraction to light. Processing of all these situations involves a complex illumination model. To abstract and simplify the implementation of the problem, this article uses the phong illumination model currently used in the hardware rendering pipeline. It calculates the corresponding specular and diffuse values based on the object material at the intersection of light and scene to determine the pixel color value. The process of ray tracing is to calculate the direction of the light at the pixel point on the screen, traverse the scene in this direction, and then perform coloring calculation. There are two main problems: 1.
2. Collision Detection between light and scene. The traditional light tracing algorithm implemented on the CPU can only be used in serial for these two problems. First, for each light, the basic parameters of light are generated, then, use the space division structure to perform Collision Detection with this Ray, and then color the pixels at the collision point. For the refraction and reflection effects, only the collision points are required to generate second and third-level light, repeat the above coloring process, and then overlay the coloring effect of the original pixels, finally, we can get high-quality rendered images with reflection and refraction effects. From the above analysis, we can see that there is no dependency between pixels when coloring each pixel, that is, there is no dependency in the light of the current layer, so this has good concurrency. The parallel implementation of ray tracing and the GPU platform is also using this feature for efficient implementation.

Shows the structure of the parallel GPU ray tracing algorithm:

Figure 2. GPU-based parallel ray tracing algorithm structure.

In general, the implementation of ray tracing on the GPU mainly includes the following parallel parts:

1. Perform Ray-generated GPU parallelization. The basic parameters of each light are generated independently.

2. parallelization of light and scene Collision Detection and coloring processes. Since light on the same layer does not have a prime dependency during coloring, it can process each Ray in parallel.

4. Optimization details

After using the above algorithm to implement the GPU-based light tracing algorithm, you can optimize the algorithm based on specific requirements. Here, we will focus on some details. Although the overall efficiency of an algorithm is determined by the algorithm architecture,

However, the detailed optimization can also greatly improve the algorithm execution efficiency, especially the targeted optimization operations for the GPU architecture.

1. When a KD-tree is created, uniform segmentation is used first. When the number of triangles reaches a certain number, the optimal segmentation is used. This improves the efficiency of KD-tree creation.

2. The KD-tree information and original scene geometric data are classified and stored in different textures, and the same type of data needs to be accessed through texture reference. In this way, the virtual cache provided by the texture can be used to improve the efficiency of data access. On the other hand, the packaging and storage of relevant data reduces the number of repeated accesses to the same data and reduces the access latency.

5. Experiment results and conclusions

Finally, after implementing the above algorithm and optimizing the corresponding details, the overall performance of the algorithm is tested through experiments. The PC platform used in the experiment is a 32-bit Windows operating system. The processor and memory are: Intel Core Duo CPU 2.8 GHz, Ram 2.0 GB, And the GPU used supports GTX 260. The core part of the algorithm is the same in the GPU version and CPU version, and the code is optimized accordingly. The light tracing film for several typical test scenarios is shown in:

Figure 3. rendering results of GPU-based ray tracing in several typical scenarios. We can see that the GPU-based ray tracing algorithm can produce high-quality rendering results.

It can be seen that high-quality rendering images can be obtained using the efficient floating point operation capability of GPU. In addition, the algorithm has an important performance improvement compared with the light tracing algorithm on the traditional CPU. By testing the rendering time, the following Algorithm Execution and acceleration ratio are obtained through statistics:

Table 1. Time list obtained by performing rendering tests on several typical scenarios. The GPU platform used in the test is GTX 260. From the above table, we can see that the algorithm is properly parallel and then transplanted to the GPU platform for an optimistic acceleration ratio.

Experiments show that after the ray tracing algorithm is reasonably parallel, the effectiveness of the GPU efficient Acceleration Operation provided by Cuda can significantly improve the performance of the traditional ray tracing algorithm, this also injects new vitality into this classic algorithm, which consumes a large amount of computing resources.

6. Future work

The efficient implementation of the ray tracing algorithm on the GPU provides powerful support for GPU promotion and application, in future work, the implementation and improvement of rendering algorithms for ray tracing and other graphics fields will mainly focus on the following aspects:

1. For the latency of GPU data access, although texture can be used to solve this problem to a certain extent, the latency still limits the efficiency of the algorithm. One solution is to package the light with pcaket, so that the traversal of a light can be expanded to traverse multiple light lines at the same time, this saves repeated accesses to a large amount of data stored in global memory, thus improving the algorithm execution efficiency.

2. Because the GPU accesses a large amount of data through textures, the virtual cache of the hardware is supported to improve the access efficiency. Therefore, the execution latency can be reduced by making better use of this feature. Through analysis, we can find that if the adjacent light is highly correlated, that is, the traversal routes of adjacent light in the KD-tree are roughly the same, in this way, we need to rationally reuse some resources when the adjacent light traverses the KD-tree, which also improves the efficiency. One aspect of future work is to study such a problem. By re-analyzing the light, we can achieve more reasonable organic organization. For example, we can sort the light by direction, this improves the algorithm efficiency.

3. using the high parallel processing feature of GPU, we will try to use Cuda to accelerate more algorithms in the computer graphics field in future research and practice, such as photonic ing, radiation algorithm, and other classical and time-consuming physical algorithms.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More